From the PISA Data Visualization Contest webpage:
PISA is a worldwide study developed by the Organisation for Economic Co-operation and Development (OECD) which examines the skills of 15-year-old school students around the world. The study assesses students’ mathematics, science, and reading skills and contains a wealth of information on students’ background, their school and the organisation of education systems. For most countries, the sample is around 5,000 students, but in some countries the number is even higher. In total, the PISA 2012 dataset contains data on 485 490 pupils.
A detailed description of the methodology of the PISA surveys can be found in the PISA 2012 Technical Report.
uses Parent Questionnaires administered to the parents of the students participating in PISA (in 11 countries for the 2012 survey);
test were administered in the language of instruction of mathematics
Technical report, p.67:
Students whose language of instruction for mathematics (the major domain for 2012), was one for which no PISA assessment materials were available. Standard 2.1 of the PISA 2012 Technical Standards (see Annex F) notes that the PISA test is administered to a student in a language of instruction provided by the sampled school to that sampled student in the major domain of the test. Thus, if no test materials were available in the language in which the sampled student is taught, the student was excluded.
PISA 2012, the fifth PISA survey covered reading, mathematics, science, problem solving and financial literacy with a primary focus on mathematics.
It was conducted in 34 OECD countries and 31 partner countries/economies. All 65 countries/economies completed the paper-based tests, with assessments lasting a total of two hours for each student.
An additional 40 minutes were devoted to the computer-based assessment of
The full list of participants can be found here.
Whether they took part in the additional computer-based assessments or not can be found in the Technical Report at pp.23-24.
The links to 2 files where provided with the Udacity description of the databases for the project:
# import packages to download files and manage folders
import os
import requests
import zipfile
# create a folder and get the files
folder_name = 'PISA_data'
if not os.path.exists(folder_name):
os.makedirs(folder_name)
urls = ['https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisa2012.csv.zip',
'https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisadict2012.csv']
for url in urls:
response = requests.get(url)
file_name = url.split('/')[-1]
with open(os.path.join(folder_name, file_name), mode='wb') as file:
file.write(response.content)
# unzip the PISA data
file_name = 'pisa2012.csv.zip'
with zipfile.ZipFile(os.path.join(folder_name, file_name)) as data_zip:
data_zip.extractall(folder_name)
print('Unzipped')
# remove the .zip file
os.remove(os.path.join(folder_name, file_name))
print('Removed')
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
# load the data and have a look
pisa_variables = pd.read_csv('PISA_data\pisadict2012.csv', encoding='latin-1', dtype='unicode')
pisa_variables.head()
pisa_variables.shape
pisa_variables.columns = ['code', 'x']
pisa_variables['x'].values
# load the PISA 2012 survey data (spoiler: it would take a lot! of time on my laptop)
pisa_data = pd.read_csv('PISA_data\pisa2012.csv', encoding='latin-1', dtype='unicode')
pisa_data.head()
pisa_data.shape
The main dataset contains 485490 rows, each one representing one student, and 636 columns with coded names.
A second files provide the dictionary for the criptic column names. There we see that each row of the main dataset
The description given above is a very summary one, and the dataset is truly fascinating for the possibilities of analysis it offers, even if it contains only the results for the main survey (the one used in all the countries).
Sadly, here I need to cut on the information I handle, because of scarcity of resources (time and computer memory). Anyway a detailed analysis is (luckily) not requested.
As said PISA 2012 focussed on mathematics, so in the dataset there are plausible scores about subset of mathematic competencies.
Moreover, Countries that participated in the survey have different languages, but also different cultures and writing systems. I will select a subset of countries such as their writing system belongs to one of these groups:
* I wanted to make the distinction between languages "that are written how they are pronunced", languages "that you cannot guess how a word is written just by hearing it" and languages that use ideograms... Wikipedia and the article Getting to the bottom of orthographic depth helped me putting things down a little more precisely.
I will keep for sure the mathematic and reading scores, as well as the language(s) spoken by the students and the language in which the test was administered (TESTLANG is the column heading).
Since the mathematic scores have subcategories, it would probably be interesting to have a look at them separately.
(More about the scores and proficiency levels in the next cells)
There is a wealth of information about the students' background as well, and it seems like a good idea to see if there are influential factors there, for instance the index of economic, social and cultural status (column "ESCS" in the dataset); but to really understand what information I have that should be used, I first need to trim the database and explore it better.
In the Technical Report (from p.296 on) it is possible to find the scales used to record the proficiency for the different areas:
Mathematics: >1, 1, 2, 3, 4, 5, 6 (highest level) *
Reading and Science: I couldn't find the scales in the document. There is, however, indicated that the scales are just one for each discipline and are respectively the same as 2009 and 2006.
All this proficiency levels are provided with a description of the competencies associated with them.
(problem solving and financial literacy are also described, but the relative plausible scores are not in the dataset)
*NB. this is the main scale for mathematics, and also for all the subscales of different competencies tested (e.g.: "formulating situations mathematically", "employing mathematical concepts, facts, ...")
We only have PLAUSIBLE VALUES: From the Technical report:
Plausible values
As with all item response scaling models, student proficiencies (or measures) are not observed; they are missing data that must be inferred from the observed item responses. There are several possible alternative approaches for making this inference. PISA uses the imputation methodology usually referred to as plausible values (PVs). PVs are a selection of likely proficiencies for students that attained each score. For each scale and subscale, five plausible values per student are included in the international database. Using item parameters anchored at their estimated values from the international calibration, the plausible values are random draws from the marginal posterior of the latent distribution for each student.
Sixty-five plausible values, five for each of the 13 PISA 2012 scales are included in the PISA 2012 database. PV1MATH to PV5MATH are for mathematical literacy; PV1SCIE to PV5SCIE for scientific literacy, PV1READ to PV5READ for reading literacy, PV1CPRO to PV5CPRO for computer problem solving assessment, PV1CMAT to PV5CMAT for the computer-based mathematics assessment and PV1CREA to PV5CREA for digital reading assessment. For the four mathematics content subscales, change and relationships, quantity, space and shape, uncertainty and data, the plausible values variables are PV1MACC to PV5MACC, PV1MACQ to PV5MACQ, PV1MACS to PV5MACS, and PV1MACU to PV5MACU respectively. For the three mathematics process subscales employ, formulate and interpret, the plausible values variables are PV1MAPE to PV5MAPE, PV1MAPF to PV5MAPF, and PV1MAPI to PV5MAPI respectively
Reading through the Technical report, I've come up with this layman explanation:
For every student, for every scale and subscale, the results of the tests have been taken and used to compute 5 plausible results that student could have reached if they had taken all the PISA tests, not only the subset contained in the booklet they tackled. This has been done for accuracy in later estimating the parameters of the population (all the students of a Country). Accuracy of those parameters is also the reason it would be better to use the weight of each student and also to repeat the calculations 5 times per scale (one for each plausible score column).
The plausible score is given in a scale that can be converted to the relative proficiency level according to the bands provided in the Technical report:
Level : Score points on the PISA scale
Level : Score points on the PISA scale
Level : Score points on the PISA scale
Every row/student has a weight (more than one, to be precise, but..).
The reason for it resides, if I got it right, in the sampling process of subregions, schools and students done in every Country and the reason to use it (them) is, in the end, to better represent the country parameters.
I'm planning to regroup the students by different parameters (primary first language = language at school = language of the PISA test), so they won't be divided by Country and won't rapresent country level proficiency.
Therefore I think it would be wrong to use the weights within this project.
# find the list of countries contained
pisa_data.CNT.unique()
By the way, these are the full names, not the 3 character Country codes as indicated by the data dictionary.. which is good, because it is easier for me to check on the Wikipedia pages the spoken languages, for the countries I have doubts about.
I'm a bit puzzled by the presence of United States of America and, separately, Florida, Connecticut and Massachusetts. Assuming that all the data included in the dataset are valid (the survey data have been revised several times, if we believe the Technical report -and we have to) the reason is possibly to be find in the sampling choices and volunteering of schools to conduct the survey. Provided there are no duplicates, it is no worries to me.
I will keep these 37 Countries/economies: 'United Arab Emirates', 'Argentina', 'Australia', 'Austria', 'Belgium', 'Canada', 'Switzerland', 'Chile', 'Colombia', 'Costa Rica', 'Germany', 'Spain', 'Finland', 'France', 'United Kingdom', 'Hong Kong-China', 'Ireland', 'Italy', 'Japan', 'Korea', 'Liechtenstein', 'Luxembourg', 'Macao-China', 'Mexico', 'New Zealand', 'Peru', 'Qatar', 'China-Shanghai', 'Florida (USA)', 'Connecticut (USA)', 'Massachusetts (USA)', 'Singapore', 'Chinese Taipei', 'Tunisia', 'Uruguay', 'United States of America', 'Vietnam'
# list of Countries to keep
to_keep = ['United Arab Emirates', 'Argentina', 'Australia', 'Austria', 'Belgium', 'Canada', 'Switzerland', 'Chile', 'Colombia', 'Costa Rica', 'Germany', 'Spain', 'Finland', 'France', 'United Kingdom', 'Hong Kong-China', 'Ireland', 'Italy', 'Japan', 'Korea', 'Liechtenstein', 'Luxembourg', 'Macao-China', 'Mexico', 'New Zealand', 'Peru', 'Qatar', 'China-Shanghai', 'Florida (USA)', 'Connecticut (USA)', 'Massachusetts (USA)', 'Singapore', 'Chinese Taipei', 'Tunisia', 'Uruguay', 'United States of America', 'Vietnam']
# select only the desired Countries:
# reusing the df name in order not to use up too much memory
pisa_data = pisa_data.loc[pisa_data.CNT.isin(to_keep)]
# clean memory from trial results (answer "y" to prompt)
%reset Out
pisa_data.shape
We had 485490 rows, now they are 314831 (we dismissed roughly 35% of the rows).
Time to eliminate some of the columns. I'm not ready to dismiss most of the columns, even if at the end I will use only a few of the 636 that are in the dataset.
I'm quite confident in dismissing the following two chunks though (I'm not bothering with picking up single columns):
from : EC04Q01A Acquired skills - Find job info - Yes, at school
to: EC04Q06C Acquired skills - Student financing - No, never
from: W_FSTR1 FINAL STUDENT REPLICATE BRR-FAY WEIGHT1
to: VAR_UNIT RANDOMLY ASSIGNED VARIANCE UNIT
# delete the columns listed above
pisa_data.drop(pisa_data.loc[:, 'EC04Q01A':'EC04Q06C'].columns, axis=1, inplace=True)
pisa_data.drop(pisa_data.loc[:, 'W_FSTR1':'VAR_UNIT'].columns, axis=1, inplace=True)
# since I'm doing this, I'll drop also the first, unnamed, column
pisa_data.drop(pisa_data.iloc[:,0:1].columns, axis=1, inplace=True)
# looking a bit better?
pisa_data.shape
From 636 columns to 535: probably still 510-520 useless columns, but 101 column better.
I'll save my subset of the database as it is now, so that if needed, I can restart running the cells from the last one before the next section, "Select variables".
pisa_data.dtypes
# save the dataset, keep ',' as separator, keep 'utf8' as encoding
pisa_data.to_csv('PISA_data\selected_countries_subset.csv', index=False, encoding='latin-1')
print('Saved')
# clean a bit of memory
pisa_data = ''
%reset Out
# in the case I need to restart the kernel or the computer
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
# reload the dictionary
pisa_variables = pd.read_csv('PISA_data\pisadict2012.csv', encoding='latin-1', dtype='unicode')
pisa_variables.columns = ['code', 'x']
# load my selected Countries data
pisa_subset = pd.read_csv('PISA_data\selected_countries_subset.csv', encoding='latin-1', dtype='unicode')
#have a look
pisa_subset.head()
pisa_subset.shape
pisa_subset.CNT.count()
pisa_variables that look interesting:
descriptions = ['Country code 3-character',
'OECD country',
'Student ID',
'Gender',
'Mother<Highest Schooling>',
'Mother Current Job Status',
'Father<Highest Schooling>',
'Father Current Job Status',
'Highest educational level of parents',
'Country of Birth International - Self',
'First language learned',
'Age started learning <test language>',
'Language of the test',
'Language spoken - Mother',
'Language spoken - Father',
'Standard or simplified set of booklets',
'Attitude towards School: Learning Outcomes',
'Attitude towards School: Learning Activities',
'Sense of Belonging to School',
"Mathematics Teacher's Classroom Management",
'Cognitive Activation in Mathematics Lessons',
'Instrumental Motivation for Mathematics',
'Mathematics Interest',
'Mathematics Work Ethic',
"Mathematics Teacher's Support",
'Mathematics Self-Concept',
'Teacher Student Relations',
'Subjective Norms in Mathematics',
'Index of economic, social and cultural status',
'Home educational resources',
'Preference for Heritage Language in Conversations with Family and Friends',
'Language at home (3-digit code)',
'International Language at Home',
'Preference for Heritage Language in Language Reception and Production',
'Plausible value 1 in mathematics',
'Plausible value 2 in mathematics',
'Plausible value 3 in mathematics',
'Plausible value 4 in mathematics',
'Plausible value 5 in mathematics',
'Plausible value 1 in content subscale of math - Change and Relationships',
'Plausible value 2 in content subscale of math - Change and Relationships',
'Plausible value 3 in content subscale of math - Change and Relationships',
'Plausible value 4 in content subscale of math - Change and Relationships',
'Plausible value 5 in content subscale of math - Change and Relationships',
'Plausible value 1 in content subscale of math - Quantity',
'Plausible value 2 in content subscale of math - Quantity',
'Plausible value 3 in content subscale of math - Quantity',
'Plausible value 4 in content subscale of math - Quantity',
'Plausible value 5 in content subscale of math - Quantity',
'Plausible value 1 in content subscale of math - Space and Shape',
'Plausible value 2 in content subscale of math - Space and Shape',
'Plausible value 3 in content subscale of math - Space and Shape',
'Plausible value 4 in content subscale of math - Space and Shape',
'Plausible value 5 in content subscale of math - Space and Shape',
'Plausible value 1 in content subscale of math - Uncertainty and Data',
'Plausible value 2 in content subscale of math - Uncertainty and Data',
'Plausible value 3 in content subscale of math - Uncertainty and Data',
'Plausible value 4 in content subscale of math - Uncertainty and Data',
'Plausible value 5 in content subscale of math - Uncertainty and Data',
'Plausible value 1 in process subscale of math - Employ',
'Plausible value 2 in process subscale of math - Employ',
'Plausible value 3 in process subscale of math - Employ',
'Plausible value 4 in process subscale of math - Employ',
'Plausible value 5 in process subscale of math - Employ',
'Plausible value 1 in process subscale of math - Formulate',
'Plausible value 2 in process subscale of math - Formulate',
'Plausible value 3 in process subscale of math - Formulate',
'Plausible value 4 in process subscale of math - Formulate',
'Plausible value 5 in process subscale of math - Formulate',
'Plausible value 1 in process subscale of math - Interpret',
'Plausible value 2 in process subscale of math - Interpret',
'Plausible value 3 in process subscale of math - Interpret',
'Plausible value 4 in process subscale of math - Interpret',
'Plausible value 5 in process subscale of math - Interpret',
'Plausible value 1 in reading',
'Plausible value 2 in reading',
'Plausible value 3 in reading',
'Plausible value 4 in reading',
'Plausible value 5 in reading',
'Plausible value 1 in science',
'Plausible value 2 in science',
'Plausible value 3 in science',
'Plausible value 4 in science',
'Plausible value 5 in science']
#get the column names
columns_to_keep = pisa_variables[pisa_variables['x'].isin(descriptions)]['code'].tolist()
pisa_chosen_var = pisa_subset[columns_to_keep]
pisa_chosen_var
pisa_chosen_var.shape
pisa_chosen_var.info()
It is not data that I can retrieve somewhere, therefore, after a bit of cleaning I will assess the columns I want to use and decide what to do.
The dataset looks OK. Anyway, I still have more variables than I will use, I will cut on them later, after exploring a little.
Test values conversion bands¶
Main mathematical literacy levels and subscales as well¶
Level : Score points on the PISA scale
- 6 : value >= 669.3
- 5 : 607.0 <= value < 669.3
- 4 : 544.7 <= value < 607.0
- 3 : 482.4 <= value < 544.7
- 2 : 420.1 <= value < 482.4
- 1 : 357.8 <= value < 420.1
- Below 1 : value < 357.8
Reading literacy performance band definitions on the PISA scale¶
Level : Score points on the PISA scale
- 6 : value > 698.32
- 5 : 625.61 < value <= 698.32
- 4 : 552.89 < value <= 625.61
- 3 : 480.18 < value <= 552.89
- 2 : 407.47 < value <= 480.18
- 1a : 334.75 < value <= 407.47
- 1b : 262.04 < value <= 334.75
Scientific literacy performance band definitions on the PISA scale¶
Level : Score points on the PISA scale
- 6 : value > 707.9
- 5 : 633.3 < value <= 707.9
- 4 : 558.7 < value <= 633.3
- 3 : 484.1 < value <= 558.7
- 2 : 409.5 < value <= 484.1
- 1 : 334.9 < value <= 409.5 (for this one there was no indication of where the bounds were included, I decided to align it with the reading scale)
# make a copy
df = pisa_chosen_var.copy() # sorry, but df is a convenient name
print('Done!')
The column names (from Excel) are:
mathematics
'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1MACC', 'PV2MACC', 'PV3MACC', 'PV4MACC', 'PV5MACC', 'PV1MACQ', 'PV2MACQ', 'PV3MACQ', 'PV4MACQ', 'PV5MACQ', 'PV1MACS', 'PV2MACS', 'PV3MACS', 'PV4MACS', 'PV5MACS', 'PV1MACU', 'PV2MACU', 'PV3MACU', 'PV4MACU', 'PV5MACU', 'PV1MAPE', 'PV2MAPE', 'PV3MAPE', 'PV4MAPE', 'PV5MAPE', 'PV1MAPF', 'PV2MAPF', 'PV3MAPF', 'PV4MAPF', 'PV5MAPF', 'PV1MAPI', 'PV2MAPI', 'PV3MAPI', 'PV4MAPI', 'PV5MAPI'
reading
'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ'
science
'PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE'
# collect the column names:
mathematics = ['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH',
'PV1MACC', 'PV2MACC', 'PV3MACC', 'PV4MACC', 'PV5MACC',
'PV1MACQ', 'PV2MACQ', 'PV3MACQ', 'PV4MACQ', 'PV5MACQ',
'PV1MACS', 'PV2MACS', 'PV3MACS', 'PV4MACS', 'PV5MACS',
'PV1MACU', 'PV2MACU', 'PV3MACU', 'PV4MACU', 'PV5MACU',
'PV1MAPE', 'PV2MAPE', 'PV3MAPE', 'PV4MAPE', 'PV5MAPE',
'PV1MAPF', 'PV2MAPF', 'PV3MAPF', 'PV4MAPF', 'PV5MAPF',
'PV1MAPI', 'PV2MAPI', 'PV3MAPI', 'PV4MAPI', 'PV5MAPI']
reading = ['PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 'PV5READ']
science = ['PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE']
# convert the columns from strings to float
all_PV = mathematics + reading + science
for col in all_PV:
df[col] = df[col].astype('float')
print('Done!')
# verify dtype
df.info()
# convert the values for mathematics and turn into categories
# define conversion function
def conversion_math (x):
if x == np.nan:
return x
if x >= 669.3:
return '6'
if 607.0 <= x < 669.3:
return '5'
if 544.7 <= x < 607.0:
return '4'
if 482.4 <= x < 544.7:
return '3'
if 420.1 <= x < 482.4:
return '2'
if 357.8 <= x < 420.1:
return '1'
if x < 357.8:
return 'Below 1'
# create the categories
values_math = ['Below 1', '1', '2', '3', '4', '5', '6']
ordered_math = pd.api.types.CategoricalDtype(ordered=True, categories=values_math)
print('Working on this..')
# apply
for col in mathematics:
df[col] = df[col].apply(conversion_math)
df[col] = df[col].astype(ordered_math)
print('Done!')
# convert the values for reading and turn into categories
# define conversion function
def conversion_reading (x):
if x == np.nan:
return x
if x > 698.32:
return '6'
if 625.61 < x <= 698.32:
return '5'
if 552.89 < x <= 625.61:
return '4'
if 480.18 < x <= 552.89:
return '3'
if 407.47 < x <= 480.18:
return '2'
if 334.75 < x <= 407.47:
return '1a'
if 262.04 < x <= 334.75:
return '1b'
if x <= 262.04:
return 'Too low' # my addition, to not leave values uncategorized
# create the categories
values_reading = ['Too low', '1b', '1a', '2', '3', '4', '5', '6']
ordered_reading = pd.api.types.CategoricalDtype(ordered=True, categories=values_reading)
print('Working on this..')
# apply
for col in reading:
df[col] = df[col].apply(conversion_reading)
df[col] = df[col].astype(ordered_reading)
print('Done!')
# convert the values for science and turn into categories
# define conversion function
def conversion_science (x):
if x == np.nan:
return x
if x > 707.9:
return '6'
if 633.3 < x <= 707.9:
return '5'
if 558.7 < x <= 633.3:
return '4'
if 484.1 < x <= 558.7:
return '3'
if 409.5 < x <= 484.1:
return '2'
if 334.9 < x <= 409.5:
return '1'
if x <= 334.9:
return 'Too low' # my addition, to not leave values uncategorized
# create the categories
values_science = ['Too low', '1', '2', '3', '4', '5', '6']
ordered_science = pd.api.types.CategoricalDtype(ordered=True, categories=values_science)
print('Working on this..')
# apply
for col in science:
df[col] = df[col].apply(conversion_science)
df[col] = df[col].astype(ordered_science)
print('Done!')
df[['PV3MATH', 'PV1READ', 'PV4SCIE']]
df.info()
# all my chosen column descriptions are in the var description, the name are in the var columns_to_keep
# unfortunately I cannot zip them, because the order is not the same
# create a subset of pisa_variables
my_columns = pisa_variables[pisa_variables['code'].isin(columns_to_keep)].copy()
# clean the descriptions a little bit
# sorry for the superlong lines, but .replace() was not working and I decided to look for the reason at a later moment
my_columns['x'] = my_columns.x.str.replace(' - ', '_').str.replace(' ','_').str.replace("'s",'').str.replace('-','_').str.replace('<','_').str.replace('>','').str.replace(',','').str.replace('__','_')
my_columns['x'] = my_columns.x.str.replace('_value_', '_level_').str.replace('content_subscale_of_','').str.replace('process_subscale_of_','').str.replace('_code_3_character', '').str.replace('_.3_digit_code.', '')
# create a dictionary of the current column names and the cleaned descriptions, set the latter as new names
columns_dict = dict(zip(my_columns['code'], my_columns['x']))
df.rename(columns=columns_dict, inplace=True)
# check up
df.head()
# parents schooling
df.Mother_Highest_Schooling.unique()
df.Father_Highest_Schooling.unique()
df.Highest_educational_level_of_parents.unique()
From the Thechnical report, p. 307
Indices on parental education were constructed by recoding educational qualifications into the following categories: (0) None, (1) ISCED 1 (primary education), (2) ISCED 2 (lower secondary), (3) ISCED Level 3B or 3C (vocational/pre-vocational upper secondary), (4) ISCED 3A (general upper secondary) and/or ISCED 4 (non-tertiary post-secondary), (5) ISCED 5B (vocational tertiary) and (6) ISCED 5A, 6 (theoretically oriented tertiary and post-graduate).
These three columns need to be converted to category
I will first, in the Mother and Father columns
and, in the parents schooling
df.Mother_Highest_Schooling = df.Mother_Highest_Schooling.str.replace("She did not complete <ISCED level 1>", "None").str.replace('<', '').str.replace('>', '').str.replace(' level ', ' ').str.replace('ISCED 3A', 'ISCED 3A, 4').str.strip()
df.Father_Highest_Schooling = df.Father_Highest_Schooling.str.replace("He did not complete <ISCED level 1>", "None").str.replace('<', '').str.replace('>', '').str.replace(' level ', ' ').str.replace('ISCED 3A', 'ISCED 3A, 4').str.strip()
df.Highest_educational_level_of_parents = df.Highest_educational_level_of_parents.str.replace('ISCED 3B, C', 'ISCED 3B, 3C').str.replace('ISCED 3A, ISCED 4', 'ISCED 3A, 4').str.strip()
# prepare the category
education_values = ['None', 'ISCED 1', 'ISCED 2', 'ISCED 3B, 3C', 'ISCED 3A, 4', 'ISCED 5B', 'ISCED 5A, 6']
ordered_ed_values = pd.api.types.CategoricalDtype(ordered=True, categories=education_values)
# apply them
df.Mother_Highest_Schooling = df.Mother_Highest_Schooling.astype(ordered_ed_values)
df.Father_Highest_Schooling = df.Father_Highest_Schooling.astype(ordered_ed_values)
df.Highest_educational_level_of_parents = df.Highest_educational_level_of_parents.astype(ordered_ed_values)
df.iloc[:, 7:12].head()
df.Father_Current_Job_Status.unique()
df.International_Language_at_Home.unique()
df.First_language_learned.unique()
df.Age_started_learning_test_language.unique()
# the last one can become an ordered category
learning_test_language = ['0 to 3 years', '4 to 6 years', '10 to 12 years', '7 to 9 years', '13 years or older']
ordered_learning_lang = pd.api.types.CategoricalDtype(ordered=True, categories=learning_test_language)
df.Age_started_learning_test_language = df.Age_started_learning_test_language.astype(ordered_learning_lang)
Of this group, df.iloc[:, 7:12], the Age_started_learning_test_language column is the only one that needed to be converted into a category. The other variables don't have an order and have few values, so they can remain strings.
df.iloc[:, 12:17].head()
df.Language_spoken_Mother.unique()
df.Language_spoken_Mother.isna().sum()
df.Language_spoken_Father.unique()
df.Standard_or_simplified_set_of_booklets.unique()
df['Attitude_towards_School:_Learning_Outcomes'].describe()
df['Attitude_towards_School:_Learning_Activities'].describe()
# This last two columns need to be converted into float
df['Attitude_towards_School:_Learning_Outcomes'] = df['Attitude_towards_School:_Learning_Outcomes'].astype(float)
df['Attitude_towards_School:_Learning_Activities'] = df['Attitude_towards_School:_Learning_Activities'].astype(float)
df[['Attitude_towards_School:_Learning_Outcomes', 'Attitude_towards_School:_Learning_Activities']].describe()
In the group df.iloc[:, 12:17] only the last two columns needed conversion: from string to float. The language spoken at home could become an ordered category, but I will decide about it later.
# let's go on
df.iloc[:, 17:22]
These can all be converted to float.
df['Sense_of_Belonging_to_School'] = df['Sense_of_Belonging_to_School'].astype(float)
df['Mathematics_Teacher_Classroom_Management'] = df['Mathematics_Teacher_Classroom_Management'].astype(float)
df['Cognitive_Activation_in_Mathematics_Lessons'] = df['Cognitive_Activation_in_Mathematics_Lessons'].astype(float)
df['Index_of_economic_social_and_cultural_status'] = df['Index_of_economic_social_and_cultural_status'].astype(float)
df['Home_educational_resources'] = df['Home_educational_resources'].astype(float)
df.iloc[:, 17:22].describe()
df.iloc[:, 22:27]
# convert column 23 and 24 to float
df['Instrumental_Motivation_for_Mathematics'] = df['Instrumental_Motivation_for_Mathematics'].astype(float)
df['Mathematics_Interest'] = df['Mathematics_Interest'].astype(float)
df.iloc[:, 23:25].describe()
df.Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends.unique()
# let's deal with this column together
df.Preference_for_Heritage_Language_in_Language_Reception_and_Production.unique()
# convert the Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends to category
language_preference = ['0', '1', '2', '3', '4', '5']
ordered_lang_preference = pd.api.types.CategoricalDtype(ordered=True, categories=language_preference)
df.Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends = df.Preference_for_Heritage_Language_in_Conversations_with_Family_and_Friends.astype(ordered_lang_preference)
df.Preference_for_Heritage_Language_in_Language_Reception_and_Production = df.Preference_for_Heritage_Language_in_Language_Reception_and_Production.astype(ordered_lang_preference)
df.Preference_for_Heritage_Language_in_Language_Reception_and_Production.unique()
df['Language_at_home'].unique()
# clean the trailing whitespaces
df.Language_at_home = df.Language_at_home.str.strip()
df.iloc[:, 27:32]
# first one has already been done,
# other 4 columns to float
df.iloc[:,28:32] = df.iloc[:, 28:32].astype(float)
df.iloc[:,28:32].describe()
df.iloc[:,32:37]
# first column to float
df.Subjective_Norms_in_Mathematics = df.Subjective_Norms_in_Mathematics.astype(float)
df.Subjective_Norms_in_Mathematics.describe()
# Language_of_test: check the values
df.Language_of_the_test.unique()
Language of the test is ok as string, but I can drop the rows where the test is not in one of my chosen languages. The groups are:
shallow_ortography: Spanish, Finnish, Italian, German
deep_ortography: English, French, Arabic
logographic: Chinese, Japanese, Korean To this I will add Shanghai dialect, Mandarin and Cantonese, because are all written with chinese characters.
Since English and Arabic belong to the same group, I will keep the "Hybrid - English + Arabic (QAT)" group, renaming it as 'English_Arabic'.
shallow_ortography =['Spanish', 'Finnish', 'Italian', 'German']
deep_ortography = ['English', 'French', 'Arabic', 'English_Arabic']
logographic = ['Chinese', 'Japanese', 'Korean', 'Shanghai dialect', 'Mandarin', 'Cantonese']
test_lang_to_keep = shallow_ortography + deep_ortography + logographic
# clean the test language labels
df.Language_of_the_test = df.Language_of_the_test.str.strip().str.replace('Hybrid.*', 'English_Arabic')
# check values
df.Language_of_the_test.unique()
# drop all rows where the test is not in test_lang_to_keep
df = df.query('Language_of_the_test in @test_lang_to_keep')
# just to be sure that only plausible levels are left
df.iloc[:,34:].columns
df.info()
Here we go! It looks good enough to me to start some exploration.
I need to know a bit more on the dataset to decide:
Since I will surely modify the dataset, I'll go with a copy.
# copy the df
clean_df = df.copy()
First, I'll have a look at the different plausible levels
# do the value change a lot among columns of the same scale?
clean_df.iloc[:,34:].head(20)
plt.figure(figsize=(20,6))
plt.subplot(1,5,1)
sb.countplot(clean_df.iloc[:,34], color=sb.color_palette()[0])
plt.subplot(1,5,2)
sb.countplot(clean_df.iloc[:,35], color=sb.color_palette()[0]);
plt.subplot(1,5,3)
sb.countplot(clean_df.iloc[:,36], color=sb.color_palette()[0]);
plt.subplot(1,5,4)
sb.countplot(clean_df.iloc[:,37], color=sb.color_palette()[0]);
plt.subplot(1,5,5)
sb.countplot(clean_df.iloc[:,38], color=sb.color_palette()[0]);
plt.figure(figsize=(20,6))
plt.subplot(1,5,1)
sb.countplot(clean_df.iloc[:,39], color=sb.color_palette()[1])
plt.subplot(1,5,2)
sb.countplot(clean_df.iloc[:,40], color=sb.color_palette()[1]);
plt.subplot(1,5,3)
sb.countplot(clean_df.iloc[:,41], color=sb.color_palette()[1]);
plt.subplot(1,5,4)
sb.countplot(clean_df.iloc[:,42], color=sb.color_palette()[1]);
plt.subplot(1,5,5)
sb.countplot(clean_df.iloc[:,43], color=sb.color_palette()[1]);
plt.figure(figsize=(20,6))
plt.subplot(1,5,1)
sb.countplot(clean_df.iloc[:,74], color=sb.color_palette()[2])
plt.subplot(1,5,2)
sb.countplot(clean_df.iloc[:,75], color=sb.color_palette()[2]);
plt.subplot(1,5,3)
sb.countplot(clean_df.iloc[:,76], color=sb.color_palette()[2]);
plt.subplot(1,5,4)
sb.countplot(clean_df.iloc[:,77], color=sb.color_palette()[2]);
plt.subplot(1,5,5)
sb.countplot(clean_df.iloc[:,78], color=sb.color_palette()[2]);
plt.figure(figsize=(20,6))
plt.subplot(1,5,1)
sb.countplot(clean_df.iloc[:,79], color=sb.color_palette()[3])
plt.subplot(1,5,2)
sb.countplot(clean_df.iloc[:,80], color=sb.color_palette()[3]);
plt.subplot(1,5,3)
sb.countplot(clean_df.iloc[:,81], color=sb.color_palette()[3]);
plt.subplot(1,5,4)
sb.countplot(clean_df.iloc[:,82], color=sb.color_palette()[3]);
plt.subplot(1,5,5)
sb.countplot(clean_df.iloc[:,83], color=sb.color_palette()[3]);
They all look very alike in each scale and subscale. I'll keep just the first column of every scale/subscale
# drop all the plausible levels 2 to 5
clean_df = clean_df.filter(regex=r'(^(?!Plausible_level_[2-5]_).*)')
clean_df.shape
clean_df.info()
Having a look above, the columns for
are all complete (298594 non-null).
All other columns are incomplete
Plausible level in reading only contains 233762 non null, and it is a variable I'm definitely planning to use. Other possible limitations are given by
Moreover, the math subscales as well are not complete (could it depend on the fact that student scoring lower did't complete all test items? Or on the booklet rotation system? I could not find the answer in the Thecnical report).
Since we have already seen them..
# math main scale and subscales
# create the grid
fig, axes = plt.subplots(2,4, figsize=(20,10))
axes = axes.flatten()
for i in range(8):
plt.sca(axes[i])
col = 34 + i # column to plot
sb.countplot(clean_df.iloc[:,col], color=sb.color_palette()[0])
plt.ylim(0, 70000) # limit of the main scale (it is the larger)
Distribution of the student in the math scale and subscales is unimodal, slightly right-skewed, with the majority of student in the levels 2 and 3, followed by 1 and below_1, and lastly 4, 5, 6.
The distribution of the subscale Uncertainty_and_Data is almost identical to the one of the main mathematical scale (first plot). The other subscale vary to different degrees.
It makes sense that there are not many students in the "6" level columns, since, at 15, math programs have surely not been completed. Hopefully the "Too low" ones are due to problem with the language (maybe students that moved from a different language area).
# reading and science NB: plotted together just for convenience
plt.figure(figsize=(10,6))
plt.subplot(1,2,1)
sb.countplot(clean_df.iloc[:,42], color=sb.color_palette()[2]);
plt.subplot(1,2,2)
sb.countplot(clean_df.iloc[:,43], color=sb.color_palette()[3]);
SCIENCE The science distribution looks more "normal", but the "Too low" column count is significant, so I would say it is still slightly right skewed, also if less than the math main scale one. As with mathematics, it makes sense that there are not many students in the "6" level column, since, at 15, science programs have surely not been completed. And again, hopefully the "Too low" ones are due to problem with the language.
READING Finally, this distribution looks normal. If people have access to schooling of any tipe, reading is surely one of the key subject and the skill is then applied to the study of almost all other topics.
Let's have a look at other variables, the ones with less non-null values.
Categories (some ordered, others not):
Country, OECD_country, Language_at_home, International_Language_at_Home, Language_of_the_test, Country_of_Birth_International_Self, Gender, Standard_or_simplified_set_of_booklets, Highest_educational_level_of_parents, Mother_Highest_Schooling, Mother_Current_Job_Status, Father_Highest_Schooling, Father_Current_Job_Status
Numerical:
Attitude_towards_School:_Learning_Outcomes, Attitude_towards_School:_Learning_Activities, Sense_of_Belonging_to_School, Mathematics_Teacher_Classroom_Management, Cognitive_Activation_in_Mathematics_Lessons, Index_of_economic_social_and_cultural_status, Home_educational_resources, Instrumental_Motivation_for_Mathematics, Mathematics_Interest, Mathematics_Work_Ethic, Mathematics_Teacher_Support, Mathematics_Self_Concept, Teacher_Student_Relations, Subjective_Norms_in_Mathematics
CATEGORICAL variables
# CATEGORIES (some ordered, others not):
categories_to_plot = ['Country', 'OECD_country', 'Country_of_Birth_International_Self', 'Language_of_the_test', 'Language_at_home', 'International_Language_at_Home', 'Gender', 'Standard_or_simplified_set_of_booklets', 'Highest_educational_level_of_parents', 'Mother_Highest_Schooling', 'Mother_Current_Job_Status', 'Father_Highest_Schooling', 'Father_Current_Job_Status']
# create the grid
fig, axes = plt.subplots(7,2, figsize=(20,40))
axes = axes.flatten()
color_range = [0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9]
for i in range(13):
plt.sca(axes[i])
col = categories_to_plot[i] # column to plot
sb.countplot(data=clean_df, y=col, color=sb.color_palette()[color_range[i]])
plt.xticks(rotation=20)
Father and mother (respectively 271,374 and 280,683 answers) highest schooling level reported is the "ISCED 3A, 4", while for the highest level between the two parents, "ISCED 5B" and "ISCED 5A, 6" contain together around 150k of the 290,593 available data.
FROM the Technical report, p. 307:
Students’ responses regarding parental education were classified using ISCED (OECD, 1999). Indices on parental education were constructed by recoding educational qualifications into the following categories: (0) None, (1) ISCED 1 (primary education), (2) ISCED 2 (lower secondary), (3) ISCED Level 3B or 3C (vocational/pre-vocational upper secondary), (4) ISCED 3A (general upper secondary) and/or ISCED 4 (non-tertiary post-secondary), (5) ISCED 5B (vocational tertiary) and (6) ISCED 5A, 6 (theoretically oriented tertiary and post-graduate). Indices with these categories were provided for the students’ mother (MISCED) and the students’ father (FISCED). In addition, the index of highest educational level of parents (HISCED) corresponds to the higher ISCED level of either parent.
The variable descripted (MISCED, FISCED and HISCED) are the one contained in the database. The survey collected the same information with the parent questionnaire (not administered in avery country), but that data is reported into another variable (PQHISCED) not included in the PISA2012 dataset, and, from the description above, it looks like it has not been used to adjust the HISCED values.
Most of fathers are working full time, around 30k are working part-time and about the same are at home (looking for a job or not). As for mothers, most of them are working as well, full- or part-time, but a larger proportion is at home.
Standard is the majority, but there are anyway more than 80k of semplified ones.
The students in the selected dataset are about 50% male and 50% female.
Let's regroup the languages into my main three categories, which are already summarized into these variables:
shallow_ortography =['Spanish', 'Finnish', 'Italian', 'German']
deep_ortography = ['English', 'French', 'Arabic', 'English_Arabic']
logographic = ['Chinese', 'Japanese', 'Korean', 'Shanghai dialect', 'Mandarin', 'Cantonese']
clean_df.Language_of_the_test.unique()
# create a column "Language_type" and populate it on the base of the language_of_the_test variable
clean_df.loc[clean_df.Language_of_the_test.isin(shallow_ortography), 'Language_type'] = 'shallow orthography'
clean_df.loc[clean_df.Language_of_the_test.isin(deep_ortography), 'Language_type'] = 'deep orthography'
clean_df.loc[clean_df.Language_of_the_test.isin(logographic), 'Language_type'] = 'logographic'
# check my new column
clean_df.Language_type.value_counts()
# plot it
labels = []
for a, b in zip(clean_df.Language_type.value_counts().index, clean_df.Language_type.value_counts().values/clean_df.Language_type.count()*100):
labels.append(a + '\n' + '{:.2f}'.format(b) + '%')
plt.pie(clean_df.Language_type.value_counts(), labels=labels, startangle=-94, counterclock=True);
plt.axis('square')
Logographic languages account for a much smaller proportion of the overall data than the other two language groups.
# numeric variables to plot
num_var_to_plot = ['Attitude_towards_School:_Learning_Outcomes', 'Attitude_towards_School:_Learning_Activities', 'Sense_of_Belonging_to_School', 'Mathematics_Teacher_Classroom_Management', 'Cognitive_Activation_in_Mathematics_Lessons', 'Index_of_economic_social_and_cultural_status', 'Home_educational_resources', 'Instrumental_Motivation_for_Mathematics', 'Mathematics_Interest', 'Mathematics_Work_Ethic', 'Mathematics_Teacher_Support', 'Mathematics_Self_Concept', 'Teacher_Student_Relations', 'Subjective_Norms_in_Mathematics']
# create the grid
fig, axes = plt.subplots(7,2, figsize=(20,40))
axes = axes.flatten()
color_range = [0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9]
for i in range(14):
plt.sca(axes[i])
col = num_var_to_plot[i] # column to plot
plt.hist(data=clean_df, x=col, color=sb.color_palette()[color_range[i]])
plt.title(col)
(warning above because of NaNs)
Well, it was good to see them all, but the variables that I really want here are
From the Technical report, p.353:
[...] the ESCS in PISA 2012 consisted of three sub-components, the highest parental occupation (HISEI), the highest parental education expressed as years of schooling (PARED) and the index of home possessions (HOMEPOS) which comprised all items on the WEALTH, CULTPOS and HEDRES scales, as well as books in the home (ST28Q01) [...]
This index has been wheighted in order to be comparable across Countries. Of course it cannot be perfect, because of the great socioeconomic and cultural differences there are among Countries, but it is my best bet for comparing students performances, trying to minimize the differences coming from factors that are external from school.
From the Technical report, p.335:
Four items regarding attitude towards school in terms of learning activities (ATTLNACT) were included in the Student Questionnaire with four response categories from “Strongly agree” to “Strongly disagree”. [...] a) Trying hard at school will help me get a good job b) Trying hard at school will help me get into a good c) I enjoy receiving good
d) Trying hard at school is important
I choose this variable, instead of "Learning Outcomes", because the consistency of its scale is described as high in the Technical report, while it is moderate-high for the other.
From the Technical report, p.331:
Nine items measuring cognitive activation in mathematics lessons (COGACT) were used in the Main Survey of PISA 2012. Table 16.29 shows the item wording and the international item parameters for this scale. Response categories were “Always or almost always”, “Often”, “Sometimes” and “Never or rarely”. [Examples are:] a) The teacher asks questions that make us reflect on the problem c) The teacher asks us to decide on our own procedures for solving complex problems d) The teacher presents problems for which there is no immediately obvious method of solution f) The teacher helps us to learn from mistakes we have made
I like this variable because, psychologically (my assumption here) it has to somehow include both a hint to the interest the student has on the subject and, as well, of the support the teacher gives to his/her class.
# Index_of_economic_social_cultural_status (ESCS): how many ESCS classes do I want to create? If I want to create any..
fig, axs = plt.subplots(2,3, figsize=(20,10))
axs = axs.flatten()
bins = [5, 10, 20, 40, 80, 160]
for i in range(6):
plt.sca(axs[i])
plt.hist(data=clean_df, x='Index_of_economic_social_and_cultural_status', bins=bins[i]);
plt.title('ESCS index distribution: {} bins'.format(bins[i]))
ESCS0_below = (clean_df.Index_of_economic_social_and_cultural_status < 0).sum()
ESCS0_above = (clean_df.Index_of_economic_social_and_cultural_status >= 0).sum()
print('students below ESCS 0: {};\nstudents at or above ESCS 0: {}\ndifference: {}'.format(ESCS0_below, ESCS0_above, (ESCS0_below-ESCS0_above)))
ESCS_minus2_below = (clean_df.Index_of_economic_social_and_cultural_status < -2).sum()
ESCS_plus2_above = (clean_df.Index_of_economic_social_and_cultural_status > 2).sum()
print('\nstudents below ESCS -2: {};\nstudents at or above ESCS +2: {}\ndifference: {}'.format(ESCS_minus2_below, ESCS_plus2_above, (ESCS_minus2_below-ESCS_plus2_above)))
print('\nnumber of NaNs: ', clean_df.Index_of_economic_social_and_cultural_status.isna().sum())
ATTENTION : this variable still has NaNs!
ESCS distribution is left skewed. If we take 0 as a neutral point, were students are "OK" (not priviledged, but neither disadvantaged), then there are 20135 more student on the disadvantaged side, than in the advantaged one. And a lot of them is in the lower part (ESCS < -2): 17777
# about the NaNs, where are them?
clean_df.loc[clean_df.Index_of_economic_social_and_cultural_status.isna()]['Language_of_the_test'].value_counts()
The ESCS index is missing in 4305 rows. As seen above, they are mostly partaining to rows whose "Language_of_the_test" is in one of the two most represented. I will delete these rows.
clean_df = clean_df.loc[~(clean_df.Index_of_economic_social_and_cultural_status.isna())]
# Attitude_towards_School:_Learning_Activities
fig, axs = plt.subplots(2,3, figsize=(20,10))
axs = axs.flatten()
bins = [5, 10, 20, 40, 80, 160]
for i in range(6):
plt.sca(axs[i])
plt.hist(data=clean_df, x='Attitude_towards_School:_Learning_Activities', bins=bins[i]);
plt.title('Attitude_towards_School:_Learning_Activities: {} bins'.format(bins[i]))
clean_df['Attitude_towards_School:_Learning_Activities'].nunique()
# Cognitive_Activation_in_Mathematics_Lessons
fig, axs = plt.subplots(2,3, figsize=(20,10))
axs = axs.flatten()
bins = [5, 10, 20, 40, 80, 160]
for i in range(6):
plt.sca(axs[i])
plt.hist(data=clean_df, x='Cognitive_Activation_in_Mathematics_Lessons', bins=bins[i]);
plt.title('Cognitive_Activation_in_Mathematics_Lessons: {} bins'.format(bins[i]))
# unique values, NaNs and their row "Language of test" values
print(clean_df['Cognitive_Activation_in_Mathematics_Lessons'].nunique())
print(clean_df.Cognitive_Activation_in_Mathematics_Lessons.isna().sum())
print(clean_df.loc[clean_df.Cognitive_Activation_in_Mathematics_Lessons.isna()]['Language_of_the_test'].value_counts())
# the language_of_the_test values:
clean_df.Language_of_the_test.value_counts()
The last two variables have a lot of NaNs, again many of them in the two bigger language-type groups, but a significative number also in the logographic group. Before deciding if removing those rows or not, I'll see what happen with the other variables.
Attitude_towards_school:_Learning_Acrivities has a small number of unique values (76), due to the fact that the scale comes from the responses given to only 4 questions. It works almost like a category. Cognitive_Activation_in_Mathematics_Lessons has more (1015), depending on 9 questions, but anyway, even if the amount of data is big, 20 bins give a good representation.
Last thing, there still are the nulls in the Plausible_level variables, in the International_Language_at_Home and in the Country_of_Birth_International_Self. If they are in the two main groups, I'll delete the rows.
# let's see where they are
clean_df.loc[clean_df.Plausible_level_1_in_math_Formulate.isna() | clean_df.Plausible_level_1_in_math_Quantity.isna()]['Language_of_the_test'].value_counts()
# delete them
clean_df = clean_df.loc[~(clean_df.Plausible_level_1_in_math_Formulate.isna() | clean_df.Plausible_level_1_in_math_Quantity.isna())]
clean_df.info()
# Country_of_Birth_International_Self and International_Language_at_Home NaNs : where
clean_df.loc[clean_df.Country_of_Birth_International_Self.isna() | clean_df.International_Language_at_Home.isna()]['Language_of_the_test'].value_counts()
# delete
clean_df = clean_df.loc[~(clean_df.Country_of_Birth_International_Self.isna() | clean_df.International_Language_at_Home.isna())]
# subset with the variables I'll keep (reordered as in the summary below)
exploration_df = clean_df[['Country', 'Student_ID', 'Gender', 'Index_of_economic_social_and_cultural_status', 'Country_of_Birth_International_Self', 'Language_at_home', 'Language_of_the_test', 'Language_type', 'International_Language_at_Home', 'Standard_or_simplified_set_of_booklets', 'Attitude_towards_School:_Learning_Activities', 'Cognitive_Activation_in_Mathematics_Lessons', 'Plausible_level_1_in_mathematics', 'Plausible_level_1_in_math_Change_and_Relationships', 'Plausible_level_1_in_math_Quantity', 'Plausible_level_1_in_math_Space_and_Shape', 'Plausible_level_1_in_math_Uncertainty_and_Data', 'Plausible_level_1_in_math_Employ', 'Plausible_level_1_in_math_Formulate', 'Plausible_level_1_in_math_Interpret', 'Plausible_level_1_in_science', 'Plausible_level_1_in_reading']].copy()
Summary for the variables I'll keep:
Index_of_economic_social_and_cultural_status : ESCS distribution is left skewed. If we take 0 as a neutral point, were students are "OK" (not priviledged, but neither disadvantaged), then there are more student on the disadvantaged side, than in the advantaged one. And a lot of them is in the lower part (ESCS < -2): values updated reranning the cell after the last cleaning:
students below ESCS 0: 140554, students at or above ESCS 0: 126991, difference: 13563
students below ESCS -2: 16582, students at or above ESCS +2: 1463, difference: 15119
Plausible_level_1_in_mathematics and subscales
Distribution of the student in the math scale and subscales is unimodal, slightly right-skewed, with the majority of student in the levels 2 and 3, followed by 1 and below_1, and lastly 4, 5, 6.
The distribution of the subscale Uncertainty_and_Data is almost identical to the one of the main mathematical scale. The other subscale vary to different degrees.
It makes sense that there are not many students in the "6" level columns, since, at 15, math programs have surely not been completed. Hopefully the "Too low" ones are due to problem with the language (maybe students that moved from a different language area);
Plausible_level_1_in_science : the distribution looks more "normal", but the "Too low" column count is significant, so I would say it is still slightly right skewed, also if less than the math main scale one. As with mathematics, it makes sense that there are not many students in the "6" level column, since, at 15, science programs have surely not been completed. And again, hopefully the "Too low" ones are due to problem with the language;
Plausible_level_1_in_reading : this distribution looks normal. If people have access to schooling of any tipe, reading is surely one of the key subject and the skill is then applied to the study of almost all other topics.
* Attitude_towards_School:_Learning_Activities and Cognitive_Activation_in_Mathematics_Lessons have a lot of NaNs, many of them in the two bigger language-type groups, but a significative number also in the logographic group. Before deciding if removing those rows or not, I'll see what happen with the other variables.
I considered the use of the variables about the highest level of education of parents (mother, father and both of them), thenumber of values for each column was similar and the last one was theoretically derived from the first two, but it was not possible (the values were not comparable, details above).
It turned out the ESCS variable has been build on top of many variables, included the educational level of parents. Moreover it has been weighted to be comparable among Countries, therefore is a much better choice.
The Plausible_levels variables I am going to used are derived from the original Plausible_values given in the dataset PISA2012 using the transformation bands described in the Technical reports 2012, 2019 and 2016. I changed the values for interpretability. From the values you could tell that to a higher score it corresponds better performance in the task, while the proficiency level offer a description of the specific abilities supposedly mastered by the student.
The Language_type is a variable I introduced, to regroup the rows into one of these 3 categories: The language the test was administered in is
The language I chose to select from the dataset are:
Unfortunately the last group contains a lot less data than the other two. On the bright side, even that group is not that small, with 30334 rows, and it is composed by many languages, and different schooling systems, as the other two groups (see cell below this for composition). I might decide to reduce the subgroups of Spanish, Italian and English where the data come from one country and has more than 7k instances.
Lastly, I dropped a few thousends of rows where there were missing values for the variables I am interested in. There are still two variables that presents NaNs, but dropping those lines would delete up to 1/3 of data for subgroups that are not too large (e.g. 2055 rows of 6081 for Japanese, 2025 of 5599 for Mandarin). I will wait to see some graphs before deciding to drop those.
# Language_type by Language_of_the_test and Country
exploration_df.groupby(['Language_type','Language_of_the_test'])['Country'].value_counts()
ATTENTION: I tried a PairGrid to have a quick look at some of the variables names, since I was just waiting for a good ecuse to reduce the data, here I go.
I'll first delete the rows with NaNs, then I'll have a look at the proportion and decide how to further trim the data
exploration_df = exploration_df.loc[~(exploration_df.Cognitive_Activation_in_Mathematics_Lessons.isna() | exploration_df['Attitude_towards_School:_Learning_Activities'].isna())]
exploration_df.Language_type.value_counts()
exploration_df.groupby(['Language_type', 'Language_of_the_test'])['Country'].value_counts()
I will sample Italian, Spanish, English, French and Greman (a little from each country)
fractions_to_drop = {'Italian': {'Italy':.85},
'Spanish': {'Mexico':.95, 'Spain':.9, 'Chile':.75, 'Argentina':.65,
'Uruguay':.6, 'Peru':.65, 'Costa Rica':.6},
'English': {'Canada':.9, 'Australia':.9, 'United Kingdom':.9,
'New Zealand':.7, 'Ireland':.75, 'Singapore':.8,
'United Arab Emirates':.8, 'Qatar':.65, 'Florida (USA)':.25,
'Massachusetts (USA)':.25, 'Connecticut (USA)':.25},
'French': {'Canada':.5, 'France':.5, 'Switzerland':.4, 'Belgium':.35},
'German': {'Switzerland':.5, 'Germany':.4, 'Luxembourg':.35, 'Austria':.35},
'Arabic': {'United Arab Emirates':.4, 'Qatar':.35, 'Tunisia':.4}
}
for lang in fractions_to_drop:
for country, fraction in fractions_to_drop[lang].items():
exploration_df = exploration_df.drop(exploration_df.loc[(exploration_df.Country == country) & (exploration_df.Language_of_the_test == lang)].sample(frac=fraction).index)
exploration_df.Language_type.value_counts()
exploration_df.groupby(['Language_type'])['Language_of_the_test'].value_counts()
# a quick look at the univariate distribution of the variables I want to focus on
plt.figure(figsize=(20,4))
plt.subplot(1,4,1)
plt.hist(data=exploration_df, x='Index_of_economic_social_and_cultural_status', bins=40, color=sb.color_palette()[4]);
plt.title('ESCS index distribution: {} bins'.format('40'))
plt.subplot(1,4,2)
sb.countplot(exploration_df.Plausible_level_1_in_mathematics, color=sb.color_palette()[0])
plt.subplot(1,4,3)
sb.countplot(exploration_df.Plausible_level_1_in_science, color=sb.color_palette()[3])
plt.subplot(1,4,4)
sb.countplot(exploration_df.Plausible_level_1_in_reading, color=sb.color_palette()[2])
# continue..
plt.figure(figsize=(20,4))
plt.subplot(1,4,1)
sb.countplot(exploration_df.Gender)
plt.subplot(1,4,2)
sb.countplot(exploration_df.International_Language_at_Home)
plt.subplot(1,4,3)
sb.countplot(exploration_df.Country_of_Birth_International_Self)
plt.subplot(1,4,4)
sb.countplot(exploration_df.Standard_or_simplified_set_of_booklets)
# continue..
plt.figure(figsize=(20,4))
plt.subplot(1,2,1)
plt.hist(exploration_df['Attitude_towards_School:_Learning_Activities'], bins=10)
plt.subplot(1,2,2)
plt.hist(exploration_df.Cognitive_Activation_in_Mathematics_Lessons, bins=20);
All these distributions look similar enough to those of the clean_df. Move on!
After further trimming the data, we can move on! :)
exploration_df.columns
# lets look fist to a possible correlation among the math scale and subscales and the reading scale
math_vars = ['Plausible_level_1_in_mathematics',
'Plausible_level_1_in_math_Change_and_Relationships',
'Plausible_level_1_in_math_Quantity',
'Plausible_level_1_in_math_Space_and_Shape',
'Plausible_level_1_in_math_Uncertainty_and_Data',
'Plausible_level_1_in_math_Employ',
'Plausible_level_1_in_math_Formulate',
'Plausible_level_1_in_math_Interpret']
fig, axs = plt.subplots(2,4, figsize=(20,10))
axs = axs.flatten()
for i in range(8):
plt.sca(axs[i])
sb.countplot(data=exploration_df, x=math_vars[i], hue='Plausible_level_1_in_reading', palette='mako_r');
There is a clear positive correlation between test result in reading and test results in mathematics.
Math subscale are very similar to the main scale, I'll keep only the main.
# math level by ESCS
g = sb.FacetGrid(data=exploration_df, col='Plausible_level_1_in_mathematics', col_wrap=4)
g.map(plt.hist, 'Index_of_economic_social_and_cultural_status', bins=40);
# math level by ESCS again
sb.violinplot(data=exploration_df, x='Plausible_level_1_in_mathematics', y='Index_of_economic_social_and_cultural_status', palette='mako_r', inner='quartile');
sb.boxplot(data=exploration_df, x='Plausible_level_1_in_mathematics', y='Index_of_economic_social_and_cultural_status', color='white');
There looks to be a correlation here as well, and it is a bit sad: very bad levels in mathematics can be "achieved" by students from every socioeconomic and cultural background. on the other side, students living in the most disadvantaged condition are not presents in the top mathematics scores.
# reading level by ESCS
sb.violinplot(data=exploration_df, x='Plausible_level_1_in_reading', y='Index_of_economic_social_and_cultural_status', palette='mako_r', inner='quartile');
sb.boxplot(data=exploration_df, x='Plausible_level_1_in_reading', y='Index_of_economic_social_and_cultural_status', color='white');
The relationship between reading and ESCS is similar to math and ESCS, maybe a little less clean in the lowest reading levels and more defined in the upper ones.
# Math levels and Gender
fig, axs = plt.subplots(2,4, figsize=(20,10))
axs = axs.flatten()
for i in range(8):
plt.sca(axs[i])
sb.countplot(data=exploration_df, x=math_vars[i], hue='Gender');
# Math level by gender (normalized)
math_gender_norm = exploration_df.groupby('Gender')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
math_gender_norm = math_gender_norm.mul(100)
math_gender_norm = math_gender_norm.rename('percent').reset_index()
# turn Plausible_level into an ordered category again
math_gender_norm.Plausible_level_1_in_mathematics = math_gender_norm.Plausible_level_1_in_mathematics.astype(ordered_math)
g = sb.catplot(data=math_gender_norm, x='Plausible_level_1_in_mathematics',y='percent',hue='Gender',
hue_order=['Male', 'Female'], kind='bar')
Girls seem to score a bit worst, generally: they are more than boys in the lower categories, and fewer than them in the 3 top categories.
# Reading level by gender (normalized)
read_gender_norm = exploration_df.groupby('Gender')['Plausible_level_1_in_reading'].value_counts(normalize=True)
read_gender_norm = read_gender_norm.mul(100)
read_gender_norm = read_gender_norm.rename('percent').reset_index()
# turn Plausible_level into an ordered category again
read_gender_norm.Plausible_level_1_in_reading = read_gender_norm.Plausible_level_1_in_reading.astype(ordered_reading)
g = sb.catplot(data=read_gender_norm, x='Plausible_level_1_in_reading',y='percent',hue='Gender',
hue_order=['Male', 'Female'], kind='point')
Contrary to math levels by gender, READING levels by gender suggests that girls are better than boys in this task.
# the one I am really interested in: Math level by language type
fig, axs = plt.subplots(2,4, figsize=(20,10))
axs = axs.flatten()
for i in range(8):
plt.sca(axs[i])
sb.countplot(data=exploration_df, x=math_vars[i], hue='Language_type', palette='icefire');
#I know it is a diverging palette, but with 3 categories it just is clearer and more pleasant than all the qualitative ones
There appear to be a difference in the performance among the 3 linguistic groups, but before trying to describe, let's normalize the counts.
# distribution of math scores by language type (normalized)
math_language_norm = exploration_df.groupby('Language_type')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
math_language_norm = math_language_norm.mul(100)
math_language_norm = math_language_norm.rename('percent').reset_index()
# turn Plausible_level into an ordered category again
math_language_norm.Plausible_level_1_in_mathematics = math_language_norm.Plausible_level_1_in_mathematics.astype(ordered_math)
g = sb.catplot(data=math_language_norm, x='Plausible_level_1_in_mathematics',y='percent',hue='Language_type',
hue_order=['deep orthography', 'shallow orthography', 'logographic'], kind='bar', palette='icefire')
After normalization (now the percentages within groups are shown) the plot changed very little.
The students using a logographic language perform better, their distribution is definitely left skewed. Both the deep and shallow language group show a distribution of math scores skewed to the right, with most of the student in the 4 lowest categories.
# math scores vs international language at home (normalized)
math_int_lang_norm = exploration_df.groupby('International_Language_at_Home')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
math_int_lang_norm = math_int_lang_norm.mul(100)
math_int_lang_norm = math_int_lang_norm.rename('percent').reset_index()
# turn Plausible_level into an ordered category again
math_int_lang_norm.Plausible_level_1_in_mathematics = math_int_lang_norm.Plausible_level_1_in_mathematics.astype(ordered_math)
sb.catplot(data=math_int_lang_norm, x='Plausible_level_1_in_mathematics', y='percent', hue='International_Language_at_Home',
kind='bar')
# reading score vs international language at home (normalized)
# just percentages, connected dot, so that it is easier to compare them
read_int_lang_norm = exploration_df.groupby('International_Language_at_Home')['Plausible_level_1_in_reading'].value_counts(normalize=True)
read_int_lang_norm = read_int_lang_norm.mul(100)
read_int_lang_norm = read_int_lang_norm.rename('percent').reset_index()
# turn Plausible_level into an ordered category again
read_int_lang_norm.Plausible_level_1_in_reading = read_int_lang_norm.Plausible_level_1_in_reading.astype(ordered_reading)
sb.catplot(data=read_int_lang_norm, x='Plausible_level_1_in_reading', y='percent', hue='International_Language_at_Home',
kind='point');
It looks like there is a disadvantage for the students whose international language at home is different from the language they took the test in. I'll drop the "other language" rows and have a look again at the relation among math scores and language type.
# see how many rows I'm talking about
exploration_df.groupby('Language_type').International_Language_at_Home.value_counts()
#subset where international language at home is the same of the test
home_test_lang_df = exploration_df[exploration_df.International_Language_at_Home=='Language of the test']
# distribution of math scores by language type (normalized) - language at home consistent with language of the test
# bars and connected point
math_language_norm2 = home_test_lang_df.groupby('Language_type')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
math_language_norm2 = math_language_norm2.mul(100)
math_language_norm2 = math_language_norm2.rename('percent').reset_index()
# turn Plausible_level into an ordered category again
math_language_norm2.Plausible_level_1_in_mathematics = math_language_norm2.Plausible_level_1_in_mathematics.astype(ordered_math)
sb.catplot(data=math_language_norm2, x='Plausible_level_1_in_mathematics',y='percent',hue='Language_type',
hue_order=['deep orthography', 'shallow orthography', 'logographic'], kind='bar', palette='icefire')
sb.catplot(data=math_language_norm2, x='Plausible_level_1_in_mathematics',y='percent',hue='Language_type',
hue_order=['deep orthography', 'shallow orthography', 'logographic'], kind='point', palette='icefire')
# let's see the difference in percentages
difference_without_intern_lang_rows = math_language_norm.copy()
difference_without_intern_lang_rows.percent = difference_without_intern_lang_rows.percent - math_language_norm2.percent
sb.catplot(data=difference_without_intern_lang_rows, x='Plausible_level_1_in_mathematics',y='percent',hue='Language_type',
hue_order=['deep orthography', 'shallow orthography', 'logographic'], kind='point', linestyles='', palette='icefire')
The distribution doesn't change much without the rows where the language at home is different from the language of the test. The percentages change in a range going from -0.3 to +0.35.
So, again, the student using a logographic language perform better, their distribution is definitely left skewed. Both the deep and shallow language group show a distribution of math scores skewed to the right, with most of the student in the 4 lowest categories. The lowest category, "Below 1" is the one where the two alphabetic groups differ most, and the deep orthography group seems to do worse between the two.
There may be many different reasons:
There is one area, Macao-China, for which the PISA2012 dataset records a lot of test administered in Cinese, and a smaller number in English.
print(clean_df[clean_df.Country=='Macao-China'].Language_of_the_test.value_counts())
print('\n')
print(clean_df[clean_df.Country=='Macao-China'].International_Language_at_Home.value_counts())
within_macao = clean_df[clean_df.Country=='Macao-China'].copy()
within_macao2 = within_macao[within_macao.International_Language_at_Home=='Language of the test']
# Chinese 4253
# English 51
within_macao2.groupby('Language_of_the_test').Language_at_home.value_counts()
# math score by language at home
macao2 = within_macao2.groupby('Language_at_home')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
macao2 = macao2.mul(100)
macao2 = macao2.rename('percent').reset_index()
# turn Plausible_level into an ordered category again
macao2.Plausible_level_1_in_mathematics = macao2.Plausible_level_1_in_mathematics.astype(ordered_math)
sb.catplot(data=macao2, x='Plausible_level_1_in_mathematics', y='percent', hue='Language_at_home', hue_order=['English', 'Cantonese', 'Mandarin', 'Chinese dialects or languages (MAC)'],
kind='point', palette='bright');
The number of test in English is not large (51), however Mandarin speaking students here are not much more (70) and the distribution of their scores follows the larger sample Cantonese distribution (4056 students).
If anything, we can think that, since the distribution of math scores for the English student is different from the one of the larger dataset, language is not a barrier per se.
NOTE: unfortunately the school system in Macao-China "does not have a single centralised set of standards or curriculum. Individual schools follow different educational models, including Chinese, Portuguese, Hong Kong, and British systems.", nonetheless "the majority of the schools in Macau are grammar schools, which offer language learning, mathematics, science subjects, social studies, etc. to the pupils", therefore it is reasonable to think that these data don't come from vocational school students (vocational schools anyhow starts there after 15).
#
sb.regplot(data=home_test_lang_df, x='Attitude_towards_School:_Learning_Activities',
y='Cognitive_Activation_in_Mathematics_Lessons', scatter_kws={'alpha':.1})
sb.regplot(data=home_test_lang_df, x='Attitude_towards_School:_Learning_Activities',
y='Index_of_economic_social_and_cultural_status', scatter_kws={'alpha':.1})
sb.regplot(data=home_test_lang_df, x='Cognitive_Activation_in_Mathematics_Lessons',
y='Index_of_economic_social_and_cultural_status', scatter_kws={'alpha':.01})
# distribution of math scores by type of booklet (normalized) - language at home consistent with language of the test
# bars and connected point
math_booklet_norm = home_test_lang_df.groupby('Standard_or_simplified_set_of_booklets')['Plausible_level_1_in_mathematics'].value_counts(normalize=True)
math_booklet_norm = math_booklet_norm.mul(100)
math_booklet_norm = math_booklet_norm.rename('percent').reset_index()
# turn Plausible_level into an ordered category again
math_booklet_norm.Plausible_level_1_in_mathematics = math_booklet_norm.Plausible_level_1_in_mathematics.astype(ordered_math)
sb.catplot(data=math_booklet_norm, x='Plausible_level_1_in_mathematics',y='percent',hue='Standard_or_simplified_set_of_booklets',
kind='bar')
sb.catplot(data=math_booklet_norm, x='Plausible_level_1_in_mathematics',y='percent',hue='Standard_or_simplified_set_of_booklets',
kind='point')
ATTENTION!!! The distribution of math scores of the students that solved the easier set of booklets looks a lot as the distribution of the scores of student of the two alphabetic language groups; The distribution of math scores of the students that solved the Standard set of booklets looks a lot as the distribution of the scores of student of the logographic language group.
# distribution of type by language type (normalized) - language at home consistent with language of the test
# bars and connected point
langtype_booklet_norm = home_test_lang_df.groupby('Standard_or_simplified_set_of_booklets')['Language_type'].value_counts(normalize=True)
langtype_booklet_norm = langtype_booklet_norm.mul(100)
langtype_booklet_norm = langtype_booklet_norm.rename('percent').reset_index()
sb.catplot(data=langtype_booklet_norm, x='Language_type',y='percent',hue='Standard_or_simplified_set_of_booklets',
kind='bar', palette='muted');
# percentage of booklet_types in each language group
#cat1
lang_type_order = ['shallow orthography', 'deep orthography', 'logographic']
#cat2
book_type_order = home_test_lang_df.Standard_or_simplified_set_of_booklets.unique().tolist()
# turn the variables into categories to avoid problems for zeros in the graph
booklet_category = pd.api.types.CategoricalDtype(ordered=False, categories=book_type_order)
home_test_lang_df.Standard_or_simplified_set_of_booklets = home_test_lang_df.Standard_or_simplified_set_of_booklets.astype(booklet_category)
langtype_category = pd.api.types.CategoricalDtype(ordered=True, categories=lang_type_order)
home_test_lang_df.Language_type = home_test_lang_df.Language_type.astype(langtype_category)
artists = [] # store reference to plot elements
baselines = np.zeros(len(lang_type_order))
lang_type_counts = home_test_lang_df.Language_type.value_counts()
# for each cat2 value
for i in range(len(book_type_order)):
book_type = book_type_order[i]
# get proportions of the cat1 values
inner_counts = home_test_lang_df[home_test_lang_df.Standard_or_simplified_set_of_booklets==book_type]['Language_type'].value_counts()
inner_props = inner_counts/lang_type_counts
# and plot them on top of previous ones
bars = plt.bar(x=np.arange(len(lang_type_order)), height=inner_props[lang_type_order], bottom=baselines)
artists.append(bars)
baselines += inner_props[lang_type_order]
plt.xticks(np.arange(len(lang_type_order)), lang_type_order)
plt.legend(reversed(artists), reversed(book_type_order), framealpha=1, loc=6)
The math scores are widely different between students that completed the standard or the simplified set of booklets, and the second type is present only in the two alphabetic language groups.
#home_test_lang_df[home_test_lang_df.Standard_or_simplified_set_of_booklets=='Standard set of booklets'].Language_type.value_counts()
Math levels vs reading levels: There is a clear positive correlation between test result in reading and test results in mathematics.
Math levels by ESCS: There looks to be a correlation here as well, and it is a bit sad: very bad levels in mathematics can be "achieved" by students from every socioeconomic and cultural background. on the other side, students living in the most disadvantaged condition are not presents in the top mathematics scores.
Math levels by gender (normalized): Girls seem to score a bit worst, generally: they are more than boys in the lower categories, and fewer than them in the 3 top categories.
math scores vs international language at home (normalized) and reading score vs international language at home (normalized): it looks like there is a disadvantage for the students whose international language at home is different from the language they took the test in. So, I dropped the "other language" rows and had a second look at the relation among math scores and language type.
Distribution of math scores by language type (normalized) - language at home consistent with language of the test: the students using a logographic language perform better, their distribution is definitely left skewed.
Both the deep and shallow language group show a distribution of math scores skewed to the right, with most of the student in the 4 lowest categories. The lowest category, "Below 1" is the one where the two alphabetic groups differ most, and the deep orthography group seems to do worse between the two.
There may be many different reasons:
Since I have an area, Macao-China, for which the PISA2012 dataset records a lot of test administered in Cinese, and a smaller number in English, I had a look at the distribution of the math score there: The number of test in English is not large (51), however Mandarin speaking students here are not much more (70) and the distribution of their scores follows the larger sample Cantonese distribution (4056 students).
If anything, we can think that, since the distribution of math scores for the English student is different from the one of the larger dataset, language is not a barrier per se.
Contrary to math levels by gender, READING levels by gender suggests that girls are better than boys in this task.
reading and ESCS: the relationship between reading and ESCS is similar to math and ESCS, maybe a little less clean in the lowest reading levels and more defined in the upper ones.
ATTENTION!!! Type of booklets: the distribution of math scores of the students that solved the easier set of booklets looks a lot as the distribution of the scores of student of the two alphabetic language groups.
The distribution of math scores of the students that solved the Standard set of booklets looks a lot as the distribution of the scores of student of the logographic language group.
the second type is present only in the two alphabetic language groups
I'll have first a look at the distributions divided by booklet type
# distribution of math scores by language type, divided by booklet type - language at home consistent with language of the test
g = sb.FacetGrid(home_test_lang_df, col='Standard_or_simplified_set_of_booklets', height=4, aspect=1.5, sharex=False, sharey=False)
g.map_dataframe(sb.countplot, 'Plausible_level_1_in_mathematics', hue='Language_type', palette=sb.color_palette()).add_legend()
!!!Look at distribution shape, height is not comparable among groups now (sample sizes differ a lot)!!!
Taking away the simplified books results, it looks that the distribution for the math scores of the two alphabetic groups improves a lot. The shallow ortography one looks normal. The deep ortography is still a bit heavier on the low result side. The logographic one obviously has not changed.
# distribution of MATH SCORES by LANGUAGE TYPE, divided by BOOKLET TYPE and by COUNTRY
# language at home consistent with language of the test
g = sb.FacetGrid(home_test_lang_df, col='Standard_or_simplified_set_of_booklets', row='Country', height=4, aspect=1.5, sharex=False, sharey=False)
g.map_dataframe(sb.countplot, 'Plausible_level_1_in_mathematics', hue='Language_type', palette=sb.color_palette()).add_legend();
Countries who chose to use the simplified version used ONLY that one. In those countries the results are all right skewed, with very bad results in general. A similar distribution in the Countries that used the Standard set of booklets is seen only in Qatar.
The "standard booklet" Countries distributions looks generally more symmetrical, with the mode varying between
Two exceptions are:
A bad limit is that, not excluding rows with International language at home different from language of the test, and simplified set of booklet sooner, some interesting countries (as Singapore or Switzerland) have been left with few data.
# have a look at the initially selected Countries data
pisa_subset.head()
pisa_subset.shape
pisa_variables that I'm keeping after the exploration done up to here:
descriptions = ['Country code 3-character',
'Student ID',
'Gender',
'Language of the test',
'Standard or simplified set of booklets',
'Attitude towards School: Learning Activities',
'Cognitive Activation in Mathematics Lessons',
'Index of economic, social and cultural status',
'Language at home (3-digit code)',
'International Language at Home',
'Plausible value 1 in mathematics',
'Plausible value 1 in reading']
#get the column names
columns_to_keep = pisa_variables[pisa_variables['x'].isin(descriptions)]['code'].tolist()
pisa_chosen_var = pisa_subset[columns_to_keep]
pisa_chosen_var
pisa_chosen_var.shape
pisa_chosen_var.info()
It is not data that I can retrieve somewhere, therefore, after a bit of cleaning I will assess the columns I want to use and decide what to do.
The dataset looks OK.
# make a copy
df1 = pisa_chosen_var.copy() # again, but df is a convenient name
print('Done!')
The column names (from Excel) are:
mathematics: 'PV1MATH'
reading: 'PV1READ'
# convert the columns from strings to float
all_PV2 = ['PV1MATH', 'PV1READ']
for col in all_PV2:
df1[col] = df1[col].astype('float')
print('Done!')
# verify dtype
df1.info()
# convert the values for mathematics and turn into categories INTO A NEW COLUMN
print('Working on this..')
# apply conversion
df1['Plausible_level_math'] = df1['PV1MATH'].apply(conversion_math)
df1['Plausible_level_math'] = df1['Plausible_level_math'].astype(ordered_math)
print('Done!')
# convert the values for reading and turn into categories INTO A NEW COLUMN
print('Working on this..')
# apply
df1['Plausible_level_reading'] = df1['PV1READ'].apply(conversion_reading)
df1['Plausible_level_reading'] = df1['Plausible_level_reading'].astype(ordered_reading)
print('Done!')
df1.head()
df1.info()
# all my chosen column descriptions are in the var description, the name are in the var columns_to_keep
# unfortunately I cannot zip them, because the order is not the same
# create a subset of pisa_variables
my_columns = pisa_variables[pisa_variables['code'].isin(columns_to_keep)].copy()
# clean the descriptions a little bit
# sorry for the superlong lines, but .replace() was not working and I decided to look for the reason at a later moment
my_columns['x'] = my_columns.x.str.replace(' - ', '_').str.replace(' ','_').str.replace("'s",'').str.replace('-','_').str.replace('<','_').str.replace('>','').str.replace(',','').str.replace('__','_')
my_columns['x'] = my_columns.x.str.replace('_value_1_', '_value_').str.replace('content_subscale_of_','').str.replace('process_subscale_of_','').str.replace('_code_3_character', '').str.replace('_.3_digit_code.', '')
# create a dictionary of the current column names and the cleaned descriptions, set the latter as new names
columns_dict = dict(zip(my_columns['code'], my_columns['x']))
df1.rename(columns=columns_dict, inplace=True)
# check up
df1.head()
df1.International_Language_at_Home.unique()
df1.Standard_or_simplified_set_of_booklets.unique()
# drop all the rows where booklets are the simplified ones
df1 = df1.query('Standard_or_simplified_set_of_booklets=="Standard set of booklets"')
# drop the column
df1 = df1.drop('Standard_or_simplified_set_of_booklets', axis=1)
df1.shape
# Attitude_towards_School:_Learning_Activities converted into float
df1['Attitude_towards_School:_Learning_Activities'] = df1['Attitude_towards_School:_Learning_Activities'].astype(float)
# Cognitive activation and ESCS index to float
df1['Cognitive_Activation_in_Mathematics_Lessons'] = df1['Cognitive_Activation_in_Mathematics_Lessons'].astype(float)
df1['Index_of_economic_social_and_cultural_status'] = df1['Index_of_economic_social_and_cultural_status'].astype(float)
df1['Language_at_home'].unique()
# clean the trailing whitespaces
df1.Language_at_home = df1.Language_at_home.str.strip()
# Language_of_test: check the values
df1.Language_of_the_test.unique()
Language of the test is ok as string, but I can drop the rows where the test is not in one of my chosen languages. The groups are:
shallow_ortography: Spanish, Finnish, Italian, German
deep_ortography: English, French, Arabic
logographic: Chinese, Japanese, Korean To this I will add Shanghai dialect, Mandarin and Cantonese, because are all written with chinese characters.
Since English and Arabic belong to the same group, I will keep the "Hybrid - English + Arabic (QAT)" group, renaming it as 'English_Arabic'.
shallow_ortography =['Spanish', 'Finnish', 'Italian', 'German']
deep_ortography = ['English', 'French', 'Arabic', 'English_Arabic']
logographic = ['Chinese', 'Japanese', 'Korean', 'Shanghai dialect', 'Mandarin', 'Cantonese']
test_lang_to_keep = shallow_ortography + deep_ortography + logographic
# clean the test language labels
df1.Language_of_the_test = df1.Language_of_the_test.str.strip().str.replace('Hybrid.*', 'English_Arabic')
# check values
df1.Language_of_the_test.unique()
# drop all rows where the test is not in test_lang_to_keep
df1 = df1.query('Language_of_the_test in @test_lang_to_keep')
df1.info()
I still need to drop the nulls from
# drop rows where the language at home is not the language of the test
df1 = df1.query('International_Language_at_Home=="Language of the test"')
# and drop the column
df1.drop('International_Language_at_Home', axis=1, inplace=True)
# drop rows without ESCS index
df1 = df1.dropna(subset=['Index_of_economic_social_and_cultural_status'])
df1.info()
df1[['Plausible_value_in_mathematics', 'Plausible_value_in_reading']].describe()
# math and reading plausible values
plt.figure(figsize=(10,6))
plt.subplot(1,2,1)
plt.hist(df1.Plausible_value_in_mathematics, color=sb.color_palette()[2], bins=120);
plt.subplot(1,2,2)
plt.hist(df1.Plausible_value_in_reading, color=sb.color_palette()[3], bins=120);
# math and reading plausible levels
plt.figure(figsize=(10,6))
plt.subplot(1,2,1)
sb.countplot(df1.Plausible_level_math, color=sb.color_palette()[2]);
plt.subplot(1,2,2)
sb.countplot(df1.Plausible_level_reading, color=sb.color_palette()[3]);
MATHEMATICS
With the dataset better cleaned from the start, the math values and levels appear to be normally distributed, with the mode at 3 for the levels, and a mean around 500 for the values
READING
The distribution of the reading level has a mode of 3 as well, and the mean of the reading values is around 500, as the one for math values, but these two distributions are slightly left-skewed.
CATEGORICAL variables
# CATEGORIES:
categories_to_plot = ['Country', 'Language_of_the_test', 'Language_at_home', 'Gender']
# create the grid
fig, axes = plt.subplots(2,2, figsize=(20,20))
axes = axes.flatten()
for i in range(4):
plt.sca(axes[i])
col = categories_to_plot[i] # column to plot
sb.countplot(data=df1, y=col, color=sb.color_palette()[i])
plt.xticks(rotation=20)
Language_of_the_test: English is the most represented, followed by Italian. I need again to regroup them into my 3 language-type categories, still the logographic group is the less numerous.
Language_at_home: a reasonable number of languages survived the last cleaning.
Let's re-regroup the languages into my main three categories, which are already summarized into these variables:
shallow_ortography =['Spanish', 'Finnish', 'Italian', 'German']
deep_ortography = ['English', 'French', 'Arabic', 'English_Arabic']
logographic = ['Chinese', 'Japanese', 'Korean', 'Shanghai dialect', 'Mandarin', 'Cantonese']
# create a column "Language_type" and populate it on the base of the language_of_the_test variable
df1.loc[df1.Language_of_the_test.isin(shallow_ortography), 'Language_type'] = 'shallow orthography'
df1.loc[df1.Language_of_the_test.isin(deep_ortography), 'Language_type'] = 'deep orthography'
df1.loc[df1.Language_of_the_test.isin(logographic), 'Language_type'] = 'logographic'
# check my new column
df1.Language_type.value_counts()
# plot it
labels = []
for a, b in zip(df1.Language_type.value_counts().index, df1.Language_type.value_counts().values/df1.Language_type.count()*100):
labels.append(a + '\n' + '{:.2f}'.format(b) + '%')
plt.pie(df1.Language_type.value_counts(), labels=labels, startangle=-74, counterclock=True);
plt.axis('square');
Logographic languages still account for a much smaller proportion of the overall data than the other two language groups.
# Index_of_economic_social_cultural_status (ESCS):
fig, axs = plt.subplots(1,3, figsize=(20,5))
axs = axs.flatten()
bins = [20, 40, 80]
for i in range(len(bins)):
plt.sca(axs[i])
plt.hist(data=df1, x='Index_of_economic_social_and_cultural_status', bins=bins[i]);
plt.title('ESCS index distribution: {} bins'.format(bins[i]))
ESCS0_below = (df1.Index_of_economic_social_and_cultural_status < 0).sum()
ESCS0_above = (df1.Index_of_economic_social_and_cultural_status >= 0).sum()
print('students below ESCS 0: {};\nstudents at or above ESCS 0: {}\ndifference: {}'.format(ESCS0_below, ESCS0_above, (ESCS0_below-ESCS0_above)))
ESCS_minus2_below = (df1.Index_of_economic_social_and_cultural_status < -2).sum()
ESCS_plus2_above = (df1.Index_of_economic_social_and_cultural_status > 2).sum()
print('\nstudents below ESCS -2: {};\nstudents at or above ESCS +2: {}\ndifference: {}'.format(ESCS_minus2_below, ESCS_plus2_above, (ESCS_minus2_below-ESCS_plus2_above)))
print('\nnumber of NaNs: ', df1.Index_of_economic_social_and_cultural_status.isna().sum())
ESCS distribution is left skewed. If we take 0 as a neutral point, were students are "OK" (not priviledged, but neither disadvantaged), then there are 17204 more student on the disadvantaged side than in the advantaged one. Before the last cleaning they were 20135 About the extremes, for ESCS < -2 now there are 962 students more than for ESCS >+2. Before there were 17777 of them.
The distribution is slightly less skewed, two possible reasons:
# Attitude_towards_School:_Learning_Activities
fig, axs = plt.subplots(1,3, figsize=(20,5))
axs = axs.flatten()
bins = [10, 20, 40]
for i in range(len(bins)):
plt.sca(axs[i])
plt.hist(data=df1, x='Attitude_towards_School:_Learning_Activities', bins=bins[i]);
plt.title('Attitude_towards_School:_Learning_Activities: {} bins'.format(bins[i]))
# Cognitive_Activation_in_Mathematics_Lessons
fig, axs = plt.subplots(1,3, figsize=(20,5))
axs = axs.flatten()
bins = [10, 20, 40]
for i in range(len(bins)):
plt.sca(axs[i])
plt.hist(data=df1, x='Cognitive_Activation_in_Mathematics_Lessons', bins=bins[i]);
plt.title('Cognitive_Activation_in_Mathematics_Lessons: {} bins'.format(bins[i]))
# unique values, NaNs and their row "Language of test" values
print(df1.Cognitive_Activation_in_Mathematics_Lessons.isna().sum())
print(df1.loc[df1.Cognitive_Activation_in_Mathematics_Lessons.isna()]['Language_of_the_test'].value_counts())
# the language_of_the_test values:
df1.Language_of_the_test.value_counts()
The last two variables have a lot of NaNs, again many of them in the two bigger language-type groups, but a significative number also in the logographic group. Since I'm not using these two variables, they will remain like this.
df1.info()
# make a copy (reordered, without Cognitive activation and Attitude towards school)
exploration_df1 = df1[['Country', 'Student_ID', 'Gender', 'Index_of_economic_social_and_cultural_status', 'Language_at_home', 'Language_of_the_test', 'Language_type', 'Plausible_value_in_mathematics', 'Plausible_value_in_reading', 'Plausible_level_math', 'Plausible_level_reading']].copy()
Data were difficult to interpret because of the presence of the countries that chose to use only the Simplified set of booklets. I had initially kept them, because they had been weighted so that the results of those Countries could be comparable with the rest of the data. It is indeed possible to use the data all together and with the correct weight for each line) to judge the performance of the schooling system of each Country compared to the others. However I am regrouping the data by a different criterion (the language type) and therefore including the Countries whose schooling system has evidently a markedly different effect on the performance of their students (and, in fact, those Countries choose the simplified booklets) was simply adding complexity (and a variable that needed to be highlighted).
On the same line, I dropped directly the rows where the language at home was not the language of the test.
Along vith the Plausible_levels variables this time I kept the original Plausible_values given in the dataset PISA2012. It is true that the levels are easier to interpret, but the values are a numeric variable that can be useful for the visualizations.
This time I did not trim the data to try and have sample of the same size, because I faound that it can be useful to explore within Countries as well, to have an idea of the effect of the schooling system.
# Language_type by Language_of_the_test and Country
exploration_df1.groupby(['Language_type','Language_of_the_test'])['Country'].value_counts()
exploration_df1.Language_type.value_counts()
exploration_df1.groupby(['Language_type'])['Language_of_the_test'].value_counts()
# a quick look at the univariate distribution of the variables, side to side, I want to focus on
plt.figure(figsize=(20,4))
plt.subplot(1,3,1)
plt.hist(data=exploration_df1, x='Plausible_value_in_mathematics', bins=120, color=sb.color_palette()[2]);
plt.subplot(1,3,2)
plt.hist(data=exploration_df1, x='Plausible_value_in_reading', bins=120, color=sb.color_palette()[0]);
plt.subplot(1,3,3)
plt.hist(data=exploration_df1, x='Index_of_economic_social_and_cultural_status', bins=120, color=sb.color_palette()[4]);
plt.title('ESCS index distribution: {} bins'.format('40'))
exploration_df1[['Plausible_value_in_mathematics', 'Plausible_value_in_reading', 'Index_of_economic_social_and_cultural_status']].describe()
Plausible values in reading and ESCS index distributions are different in the shape, with the latter presenting std (in scale), but they somehow look very similar in their left-skewedness (if you can say so)
After better cleaning the data, we can move on! :)
exploration_df1.columns
# lets look fist to a possible correlation among the math scale and subscales and the reading scale
sb.countplot(data=exploration_df1, x='Plausible_level_math', hue='Plausible_level_reading', palette='mako_r');
sb.regplot(data=exploration_df1, x='Plausible_value_in_mathematics',
y='Plausible_value_in_reading', scatter_kws={'alpha':.01});
There is a clear positive correlation between test result in reading and test results in mathematics.
The scatterplot could be improved sampling the data.
# cut the ESCS index into bands and create a new category column
min_ESCS = exploration_df1.Index_of_economic_social_and_cultural_status.min()
max_ESCS = exploration_df1.Index_of_economic_social_and_cultural_status.max()
band_limits = np.linspace(min_ESCS, max_ESCS, num=10, endpoint=True)
# create the category
ESCS_bands = ['-5.32 to -4.38', '-4.38 to -3.44', '-3.44 to -2.51',
'-2.51 to -1.57', '-1.57 to -0.63', '-0.63 to 0.31',
'0.31 to 1.24', '1.24 to 2.18', '2.18 to 3.12']
ESCS_bands_order = pd.api.types.CategoricalDtype(ordered=True, categories=ESCS_bands)
def numerical_to_category(x, band_limits=[0,1,2,3,4,5]):
for i in range(len(band_limits)-1):
if band_limits[i] <= x < band_limits[i+1]:
return '{:.2f} to {:.2f}'.format(band_limits[i], band_limits[i+1])
elif x == band_limits[-1]:
return '{:.2f} to {:.2f}'.format(band_limits[-2], band_limits[-1])
# apply to the ESCS index column and create a new column
exploration_df1['ESCS_levels'] = exploration_df1.Index_of_economic_social_and_cultural_status.apply(numerical_to_category, band_limits=band_limits)
# turn into ordered category
exploration_df1.ESCS_levels = exploration_df1.ESCS_levels.astype(ESCS_bands_order)
# ESCS index by math level
g = sb.FacetGrid(data=exploration_df1, col='Plausible_level_math', col_wrap=4)
g.map(plt.hist, 'Index_of_economic_social_and_cultural_status', bins=40);
# math level by ESCS again
sb.violinplot(data=exploration_df1, x='ESCS_levels', y='Plausible_value_in_mathematics', color=sb.color_palette()[4], inner='quartile');
sb.boxplot(data=exploration_df1, x='ESCS_levels', y='Plausible_value_in_mathematics', color='white');
If, we plot the math levels by the ESCS index, we see the curve starting slightly right-skewed for math level 'below 1', going normal and then bending on the other side, with an ESCS index mean that increases, as we move up the math levels. When we reach level 6 it is definitely left-skewed.
There looks to be a correlation here, and it is a bit sad: very bad levels in mathematics can be "achieved" by students from every socioeconomic and cultural background (but actually, students in the lowest ESCS level are not the worst achievers). On the other side, students living in the most disadvantaged condition are not presents in the top mathematics scores.
# ESCS index by reading levels
g = sb.FacetGrid(data=exploration_df1, col='Plausible_level_reading', col_wrap=4)
g.map(plt.hist, 'Index_of_economic_social_and_cultural_status', bins=40);
# reading level by ESCS
sb.violinplot(data=exploration_df1, x='ESCS_levels', y='Plausible_value_in_reading', palette='mako_r', inner='quartile');
sb.boxplot(data=exploration_df1, x='ESCS_levels', y='Plausible_value_in_reading', color='white');
The relationship between reading and ESCS is similar to math and ESCS.
# Math value by gender (normalized)
g = sb.FacetGrid(data=exploration_df1, hue='Gender', hue_order=['Male', 'Female'], height=6, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_mathematics', norm_hist=True);
g.add_legend();
exploration_df1.groupby('Gender').Plausible_value_in_mathematics.describe()
Girls seem to score a bit worst, generally: their mean is slightly lower, and their std is narrower.
# reading values by gender
g = sb.FacetGrid(data=exploration_df1, hue='Gender', hue_order=['Male', 'Female'], height=6, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_reading', norm_hist=True);
g.add_legend();
# Reading level by gender (normalized)
read_gender_norm = exploration_df1.groupby('Gender')['Plausible_level_reading'].value_counts(normalize=True)
read_gender_norm = read_gender_norm.mul(100)
read_gender_norm = read_gender_norm.rename('percent').reset_index()
# turn Plausible_level into an ordered category again
read_gender_norm.Plausible_level_reading = read_gender_norm.Plausible_level_reading.astype(ordered_reading)
g = sb.catplot(data=read_gender_norm, x='Plausible_level_reading',y='percent',hue='Gender',
hue_order=['Male', 'Female'], kind='point')
Contrary to math levels by gender, READING levels by gender suggests that girls are better than boys in this task.
# the one I am really interested in: Math level by language type
sb.countplot(data=exploration_df1, x='Plausible_level_math', hue='Language_type', palette='icefire');
plt.legend(loc=1)
#I know it is a diverging palette, but with 3 categories it just is clearer and more pleasant than all the qualitative ones
There appear to be a difference in the performance among the 3 linguistic groups, but before trying to describe, let's normalize the counts.
# distribution of math scores by language type (normalized)
math_language_norm = exploration_df1.groupby('Language_type')['Plausible_level_math'].value_counts(normalize=True)
math_language_norm = math_language_norm.mul(100)
math_language_norm = math_language_norm.rename('percent').reset_index()
# turn Plausible_level into an ordered category again
math_language_norm.Plausible_level_math = math_language_norm.Plausible_level_math.astype(ordered_math)
g = sb.catplot(data=math_language_norm, x='Plausible_level_math',y='percent',hue='Language_type',
hue_order=['deep orthography', 'shallow orthography', 'logographic'], kind='bar', palette='icefire')
# math values by language type
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=6, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_mathematics', kde=False, bins=120);
g.add_legend();
# math values by language type (normalized)
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=6, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_mathematics', norm_hist=True, bins=120);
g.add_legend();
exploration_df1.groupby('Language_type').Plausible_value_in_mathematics.describe()
After normalization (now the percentages within groups are shown) the plot changed very little.
The students using a logographic language perform better. Using performance levels it seems that their distribution is left skewed. Using the values, wich are more fine-grained, the distributions of each language group results normal: The logographic group has an higher mean and a larger std (top of the curve is kind of flattened). The deep orthography language group distribution has a similar shape, but with a lower mean (the lowest of the three). The shallow orthography group is in the middle, with a mean a little above the one of the deep orthography languages, and the smallest std.
# reading values vs language type (normalized)
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=5, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_reading', norm_hist=True, bins=120);
g.add_legend();
It looks like these curves are all slightly left-skewed. logographic group scores better in reading as well, deep orthography group tends to have more scores under a 400 values (level 2), but past that level the distribution matches the shallow one.
# ESCS by language type
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=5, aspect=1.5)
g.map(sb.distplot, 'Index_of_economic_social_and_cultural_status', norm_hist=True, bins=120);
g.add_legend();
ESCS and math (or reading) performance are positiveli correlated, BUT a higher ESCS is not the reason of the better performance of logographic groups, since they generally have a worst ESCS
There is one area, Macao-China, for which the PISA2012 dataset records a lot of test administered in Cinese, and a smaller number in English.
print(exploration_df1[exploration_df1.Country=='Macao-China'].Language_of_the_test.value_counts())
within_macao = exploration_df1[exploration_df1.Country=='Macao-China'].copy()
within_macao.Language_at_home.value_counts()
# reading values score by language at home
g = sb.FacetGrid(data=within_macao, hue='Language_at_home', height=5, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_reading', norm_hist=True, bins=20);
g.add_legend();
# math values by language at home
g = sb.FacetGrid(data=within_macao, hue='Language_at_home', height=5, aspect=1.5)
g.map(sb.distplot, 'Plausible_value_in_mathematics', norm_hist=True, bins=20);
g.add_legend();
Data for Macao-China has not changed. The number of test in English is not large (51), however Mandarin speaking students here are not much more (70) and the distribution of their scores follows the larger sample Cantonese distribution (4056 students).
If anything, we can think that, since the distribution of math scores for the English student is different from the one of the larger dataset, in the way that it looks bimodal as the reading values one, language is not a barrier per se.
NOTE: unfortunately the school system in Macao-China "does not have a single centralised set of standards or curriculum. Individual schools follow different educational models, including Chinese, Portuguese, Hong Kong, and British systems.", nonetheless "the majority of the schools in Macau are grammar schools, which offer language learning, mathematics, science subjects, social studies, etc. to the pupils", therefore it is reasonable to think that these data don't come from vocational school students (vocational schools anyhow starts there after 15).
Math levels vs reading levels: There is a clear positive correlation between test result in reading and test results in mathematics.
Math levels by ESCS: If, we plot the math levels by the ESCS index, we see the curve starting slightly right-skewed for math level 'below 1', going normal and then bending on the other side, with an ESCS index mean that increases, as we move up the math levels. When we reach level 6 it is definitely left-skewed.
There looks to be a correlation here, and it is a bit sad: very bad levels in mathematics can be "achieved" by students from every socioeconomic and cultural background (but actually, students in the lowest ESCS level are not the worst achievers). On the other side, students living in the most disadvantaged condition are not presents in the top mathematics scores.
Math levels by gender (normalized): Girls seem to score a bit worst, generally: their mean is slightly lower, and their std is narrower.
Math values by language type: The students using a logographic language perform better. Using performance levels it seems that their distribution is left skewed. Using the values, wich are more fine-grained, the distributions of each language group results normal:
Data for Macao-China has not changed. The number of test in English is not large (51), however Mandarin speaking students here are not much more (70) and the distribution of their scores follows the larger sample Cantonese distribution (4056 students).
If anything, we can think that, since the distribution of math scores for the English student is different from the one of the larger dataset, in the way that it looks bimodal as the reading values one, language is not a barrier per se.
Contrary to math levels by gender, READING levels by gender suggests that girls are better than boys in this task.
reading and ESCS: The relationship between reading and ESCS is similar to math and ESCS.
# relationship between ESCS and math values by language type
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=5, aspect=1.5)
g.map(sb.regplot, 'Index_of_economic_social_and_cultural_status', 'Plausible_value_in_mathematics', scatter_kws={'alpha':0.01})
# .add_legend() does not work right (colors are depending by the alpha value)
import matplotlib
name_to_color = {
'deep orthography': sb.color_palette()[0],
'shallow orthography': sb.color_palette()[1],
'logographic': sb.color_palette()[2],
}
patches = [matplotlib.patches.Patch(color=v, label=k) for k,v in name_to_color.items()]
plt.legend(handles=patches);
It looks like the ESCS level is more important in Countries with a deep orthograpy language. Shallow orthography and logograpic languages show a similar correlation between ESCS and Math scores, but there is some other factor that places the logografic group higher.
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=5, aspect=1.5)
g.map(sb.regplot, 'Index_of_economic_social_and_cultural_status', 'Plausible_value_in_mathematics', scatter_kws={'alpha':0.01})
# relationship between reading and math values by language type
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', height=5, aspect=1.5)
g.map(sb.regplot, 'Plausible_value_in_reading', 'Plausible_value_in_mathematics', scatter_kws={'alpha':0.01})
# .add_legend() does not work right
import matplotlib
name_to_color = {
'deep orthography': sb.color_palette()[0],
'shallow orthography': sb.color_palette()[1],
'logographic': sb.color_palette()[2],
}
patches = [matplotlib.patches.Patch(color=v, label=k) for k,v in name_to_color.items()]
plt.legend(handles=patches);
(all language type in the same graph, because it is easier to compare the angle of the regression line)
It looks like for the logographic languages the correlation between reading performance and math performance is stronger: for the same performance in reading at the PISA test, students in the logographic group perform better in math than students of the other groups.
Of course, this could very vell depend on other causes, like a schooling system that works harder on mathematics.
# relationship between reading and math values by language type, by country
g = sb.FacetGrid(data=exploration_df1, hue='Language_type', col='Country',
col_wrap=3, xlim=(0,1000), ylim=(0,1000), legend_out=True)
g.map(sb.regplot, 'Plausible_value_in_reading', 'Plausible_value_in_mathematics', scatter_kws={'alpha':0.1});
name_to_color = {
'deep orthography': sb.color_palette()[0],
'shallow orthography': sb.color_palette()[1],
'logographic': sb.color_palette()[2],
}
patches = [matplotlib.patches.Patch(color=v, label=k) for k,v in name_to_color.items()]
plt.legend(handles=patches, loc=6, bbox_to_anchor=(1,0.5));
There are a few interesting Countries with two languages type each Belgium, Switzerland, Luxembourg : deep and shallow orthography Hong Kong-China, Macao-China : deep orthography and logographic
# focus on those countries
exploration_df1_subset = exploration_df1[exploration_df1.Country.isin(['Belgium', 'Switzerland', 'Luxembourg', 'Hong Kong-China', 'Macao-China'])]
# plot them
g = sb.FacetGrid(data=exploration_df1_subset, hue='Language_type', col='Country',
col_wrap=3, xlim=(0,1000), ylim=(0,1000), legend_out=True)
g.map(sb.regplot, 'Plausible_value_in_reading', 'Plausible_value_in_mathematics', scatter_kws={'alpha':0.1});
name_to_color = {
'deep orthography': sb.color_palette()[0],
'shallow orthography': sb.color_palette()[1],
'logographic': sb.color_palette()[2],
}
patches = [matplotlib.patches.Patch(color=v, label=k) for k,v in name_to_color.items()]
plt.legend(handles=patches, loc=6, bbox_to_anchor=(1,0.5));
# number of rows each
exploration_df1_subset.groupby('Country').Language_of_the_test.value_counts()
I have already seen that Macao has very few English (blue) datapoints. Hong Kong has even less of them (13), so I would not dare to make comments on the trendline.
In Belgium and Luxembourg there is a conversion of the regression line. {Luxembourg](https://omniglot.com/writing/luxembourgish.htm) schooling system is multilingual and students from different language backgrounds study together. Belgium schooling system is different, as it is divided by linguistic groups, but has a common bases of basic competence levels that has to be reached by every student at the end of every cycle. The Swiss system is differently organized, in the way that the Federation only defines the compulsory years of education, while each Cantons has then the faculty of organizing the system. The shallow language group here, contains both German and Italian. German Cantons and the Italian Ticino may be interesting to compare (even if Italian datapoints are much less: 304 vs 4983).
# sample the german and french data for switzerland
np.random.seed(55)
switzerland_df = exploration_df1.query('Country=="Switzerland"')
fractions_to_drop = {'French':.9, 'German':.95}
for lang, frac in fractions_to_drop.items():
switzerland_df = switzerland_df.drop(switzerland_df.loc[switzerland_df.Language_of_the_test == lang].sample(frac=frac).index)
switzerland_df.Language_of_the_test.value_counts()
# ESCS by language type in Switzerland
g = sb.FacetGrid(data=exploration_df1.query('Country=="Switzerland"'),
hue='Language_of_the_test', palette='icefire', height=6, aspect=1.5)
g.map(sb.distplot, 'Index_of_economic_social_and_cultural_status', hist=False, bins=120);
g.add_legend();
ESCS index distribution is the same for the different linguistic areas
# ESCS, math, language type
# plot the full data and the sampled ones, to be sure the trendlines stay the same
g1 = sb.FacetGrid(data=exploration_df1.query('Country=="Switzerland"'), hue='Language_of_the_test',
palette='icefire', xlim=(-6,5), ylim=(100,800), height=5, aspect=1)
g1.map(sb.regplot, 'Index_of_economic_social_and_cultural_status', 'Plausible_value_in_mathematics', scatter=False)
g1.add_legend()
#g2 = sb.FacetGrid(data=switzerland_df, hue='Language_of_the_test', palette='icefire', height=5, aspect=1)
#g2.map(sb.regplot, 'Plausible_value_in_reading', 'Plausible_value_in_mathematics')
#g2.add_legend()
This graph seems to suggest that the schooling system is more important than the language type. Here we have, in the order, a shallow language group member (Italian) that for high values of reading performance, scores worse in mathemathic than a deep language (French), which, however, scores worse than another shallow language (German).
g2 = sb.FacetGrid(data=exploration_df1_subset.query('Country=="Switzerland"'), hue='Language_of_the_test', col='ESCS_levels',
col_wrap=2, palette='icefire', height=5, aspect=1, xlim=(100,900), ylim=(100,900))
g2.map(sb.regplot, 'Plausible_value_in_reading', 'Plausible_value_in_mathematics')
g2.add_legend();
To have a better look at the role of the educational system, since it seems most relevant, we can observe the distributions of the English speaking Countries.
# get the subset of english speaking Countries
english_df = exploration_df1.query('Language_of_the_test=="English"')
english_df.groupby('Country').Plausible_value_in_mathematics.describe()
# have a look at their math values distribution
g = sb.FacetGrid(data=english_df, row='Country', height=1, aspect=3, xlim=(100,900), ylim=(0,0.01))
g.map(sb.distplot, 'Plausible_value_in_mathematics', norm_hist=True, bins=100)
I will keep only the Countries with more than 1000 rows, and then I will add the ESCS level (cutting the two lowest and the top one, which are almost empty)
# select Countries with more than 1000 rows
english_count = english_df.Country.value_counts().to_dict().items()
english_df = english_df[english_df.Country.isin([key for key, val in english_count if val > 1000])]
# get the category values to copy and paste
english_df.ESCS_levels.dtype
# drop the lines in the two lowest ESCS levels and in the top one
ESCS_drop = ['-5.32 to -4.38', '-4.38 to -3.44', '2.18 to 3.12']
english_df = english_df.drop(english_df[english_df.ESCS_levels.isin(ESCS_drop)].index)
# Modify the category
ESCS_ordered2 = ['-3.44 to -2.51', '-2.51 to -1.57', '-1.57 to -0.63', '-0.63 to 0.31', '0.31 to 1.24', '1.24 to 2.18']
ESCS_middle_cat = pd.api.types.CategoricalDtype(ordered=True, categories=ESCS_ordered2)
english_df.ESCS_levels = english_df.ESCS_levels.astype(ESCS_middle_cat)
# plot the math scores by Country divided by ESCS level
g = sb.FacetGrid(data=english_df, col='ESCS_levels', row='Country', height=1, aspect=1.8, xlim=(100,900), ylim=(0,0.01))
g.map(sb.distplot, 'Plausible_value_in_mathematics', norm_hist=True, bins=100)
g.add_legend();
# plot the math scores by Country divided by ESCS level (English speaking countries)
plt.figure(figsize=(15,12))
g = sb.pointplot(data=english_df, x="ESCS_levels", y="Plausible_value_in_mathematics",
hue="Country", dodge=.3, palette='bright')
# collect the logographic countries
logographic_df = exploration_df1.query('Language_type=="logographic"')
logographic_df.groupby('Country').Language_of_the_test.value_counts()
# drop the lines in the two lowest ESCS levels and in the top one
# list is ESCS_drop = ['-5.32 to -4.38', '-4.38 to -3.44', '2.18 to 3.12']
logographic_df = logographic_df.drop(logographic_df[logographic_df.ESCS_levels.isin(ESCS_drop)].index)
# Modify the category
# cat is ESCS_ordered2 = ['-3.44 to -2.51', '-2.51 to -1.57', '-1.57 to -0.63', '-0.63 to 0.31', '0.31 to 1.24', '1.24 to 2.18']
# ESCS_middle_cat = pd.api.types.CategoricalDtype(ordered=True, categories=ESCS_ordered2)
logographic_df.ESCS_levels = logographic_df.ESCS_levels.astype(ESCS_middle_cat)
# plot the math scores by logographic languages (=Countries) divided byESCS (logographic)
plt.figure(figsize=(10,8))
g = sb.pointplot(data=logographic_df, x="ESCS_levels", y="Plausible_value_in_mathematics", hue="Country")
# plot the math scores by Country divided by ESCS level
g = sb.FacetGrid(data=logographic_df, col='ESCS_levels', row='Country', height=1, aspect=2, xlim=(100,900), ylim=(0,0.01))
g.map(sb.distplot, 'Plausible_value_in_mathematics', norm_hist=True, bins=100)
g.add_legend();
# just to have them all
g = sb.FacetGrid(data=exploration_df1, hue="Country", col='Language_type', palette='bright', height=6, aspect=1.5)
g.map(sb.pointplot, "ESCS_levels", "Plausible_value_in_mathematics",
dodge=.3)
# just to have them all
sb.catplot(data=exploration_df1, x='ESCS_levels', y='Plausible_value_in_mathematics',
hue='Language_type', kind='point');
relationship between ESCS and math values by language type: it looks like the ESCS level is more important in Countries with a deep orthograpy language. Shallow orthography and logograpic languages show a similar correlation between ESCS and Math scores, but there is some other factor that places the logografic group higher.
relationship between reading and math values by language type, by country: there are a few interesting Countries with two languages type each Belgium, Switzerland, Luxembourg : deep and shallow orthography Hong Kong-China, Macao-China : deep orthography and logographic Macao has very few English (blue) datapoints. Hong Kong has even less of them (13), so I would not dare to make comments on the trendline.
In Belgium and Luxembourg there is a conversion of the regression line. {Luxembourg](https://omniglot.com/writing/luxembourgish.htm) schooling system is multilingual and students from different language backgrounds study together. Belgium schooling system is different, as it is divided by linguistic groups, but has a common bases of basic competence levels that has to be reached by every student at the end of every cycle.
The Swiss system is differently organized, in the way that the Federation only defines the compulsory years of education, while each Cantons has then the faculty of organizing the system. The shallow language group here, contains both German and Italian. German Cantons and the Italian Ticino may be interesting to compare (even if Italian datapoints are much less: 304 vs 4983).
Focus on Switzerland:
Focus on English:
# save the dataset needed in the slideshow
exploration_df1.to_csv('df_slideshow1.csv', index=False, encoding='utf-8')